When you're the support engineer, the analyst, and the firefighter all in one person. When you're the support engineer, the analyst, and the firefighter all in one person. The Context: Big Data, Real Clients, Real Problems I work in a data platform team supporting clients. Our stack includes Snowflake, BigQuery, and the product exposes millions of data points daily: customer metrics, store-level analytics, historical trends, etc. Snowflake BigQuery, millions of data points daily So, when someone said, “Why is my dashboard empty?” it wasn’t just a UI bug. Something was broken deep in the data pipeline. “Why is my dashboard empty?” Spoiler: entire chunks of data never made it into our platform. entire chunks of data never made it into our platform Step 1: Investigating the Black Hole The first clue came from a product delivery team: some accounts had incomplete datasets. We quickly traced the flow: Raw data entered MySQL It was meant to move to Google Cloud Storage (GCS) But didn’t make it into Snowflake Raw data entered MySQL MySQL It was meant to move to Google Cloud Storage (GCS) Google Cloud Storage (GCS) But didn’t make it into Snowflake Snowflake No alarms. No alerts. Just... silent failure. The parser in the GCS → Snowflake handoff was skipping records silently due to a malformed schema and lack of validation. Step 2: Build a Monitoring System (In a Week) I decided to build an easy but powerful monitoring pipeline using Python and Snowflake. No waiting for a long-term product fix because this had to work now. monitoring pipeline using Python and Snowflake now Step 2.1: Connect to Snowflake via Python #a connector import snowflake.connector import pandas as pd conn = snowflake.connector.connect( user='user', password='password', account='account_id', warehouse='warehouse', database='db', schema='schema' ) #just an example query = """ SELECT customer_id, module, max(event_date) as last_date FROM analytics_table GROUP BY customer_id, module """ df = pd.read_sql(query, conn) #a connector import snowflake.connector import pandas as pd conn = snowflake.connector.connect( user='user', password='password', account='account_id', warehouse='warehouse', database='db', schema='schema' ) #just an example query = """ SELECT customer_id, module, max(event_date) as last_date FROM analytics_table GROUP BY customer_id, module """ df = pd.read_sql(query, conn) Step 2.2: Detect Delays with Pandas df["lag_days"] = (pd.Timestamp.today() - pd.to_datetime(df["last_date"])).dt.days df["status"] = df["lag_days"].apply(lambda x: "valid" if x <= 1 else "delayed") df["lag_days"] = (pd.Timestamp.today() - pd.to_datetime(df["last_date"])).dt.days df["status"] = df["lag_days"].apply(lambda x: "valid" if x <= 1 else "delayed") Now, I had a clear view of what modules were fresh and which were weeks behind. Step 2.3: Store the Results in a New Monitoring Table from sqlalchemy import create_engine #example engine = create_engine( 'snowflake://user:password@account/db/schema?warehouse=warehouse' ) df.to_sql("monitoring_status", con=engine, if_exists="replace", index=False) from sqlalchemy import create_engine #example engine = create_engine( 'snowflake://user:password@account/db/schema?warehouse=warehouse' ) df.to_sql("monitoring_status", con=engine, if_exists="replace", index=False) Step 3: Show the Pain Visually Using Looker Studio, I connected directly to the monitoring table. Looker Studio I built a dashboard that showed: % of modules delayed per customer Time since last successful sync New issues as they appeared % of modules delayed per customer Time since last successful sync New issues as they appeared This made leadership see the gap and prioritize resources. see Step 4: Fix the Root Cause Meanwhile, we fixed the broken parser and restarted all affected pipelines. We also retroactively reprocessed all lost data. Thanks to monitoring, it validated its complete. Outcomes Recovered 100% of lost data Created a real-time, always-on monitoring dashboard Reduced future blind spots by enabling automatic validation Prevented client escalations before they even noticed Recovered 100% of lost data 100% of lost data Created a real-time, always-on monitoring dashboard real-time, always-on monitoring dashboard Reduced future blind spots by enabling automatic validation Prevented client escalations before they even noticed client escalations Final Thoughts Sometimes, all you need is a smart support engineer, a Python script, and a clear view of what’s wrong. If your data platform has millions of moving parts — assume something is broken. Then go build something to catch it.