When you're the support engineer, the analyst, and the firefighter all in one person. When you're the support engineer, the analyst, and the firefighter all in one person. The Context: Big Data, Real Clients, Real Problems I work in a data platform team supporting clients. Our stack includes Snowflake, BigQuery, and the product exposes millions of data points daily: customer metrics, store-level analytics, historical trends, etc. Snowflake BigQuery, millions of data points daily So, when someone said, “Why is my dashboard empty?”  it wasn’t just a UI bug. Something was broken deep in the data pipeline. “Why is my dashboard empty?” Spoiler: entire chunks of data never made it into our platform. entire chunks of data never made it into our platform Step 1: Investigating the Black Hole The first clue came from a product delivery team: some accounts had incomplete datasets. We quickly traced the flow: Raw data entered MySQL
It was meant to move to Google Cloud Storage (GCS)
But didn’t make it into Snowflake Raw data entered MySQL MySQL It was meant to move to Google Cloud Storage (GCS) Google Cloud Storage (GCS) But didn’t make it into Snowflake Snowflake No alarms. No alerts. Just... silent failure. The parser in the GCS → Snowflake handoff was skipping records silently due to a malformed schema and lack of validation. Step 2: Build a Monitoring System (In a Week) I decided to build an easy but powerful monitoring pipeline using Python and Snowflake. No waiting for a long-term product fix because this had to work now. monitoring pipeline using Python and Snowflake now Step 2.1: Connect to Snowflake via Python #a connector
import snowflake.connector
import pandas as pd

conn = snowflake.connector.connect(
    user='user',
    password='password',
    account='account_id',
    warehouse='warehouse',
    database='db',
    schema='schema'
)
#just an example
query = """
SELECT customer_id, module, max(event_date) as last_date
FROM analytics_table
GROUP BY customer_id, module
"""
df = pd.read_sql(query, conn) #a connector
import snowflake.connector
import pandas as pd

conn = snowflake.connector.connect(
    user='user',
    password='password',
    account='account_id',
    warehouse='warehouse',
    database='db',
    schema='schema'
)
#just an example
query = """
SELECT customer_id, module, max(event_date) as last_date
FROM analytics_table
GROUP BY customer_id, module
"""
df = pd.read_sql(query, conn) Step 2.2: Detect Delays with Pandas df["lag_days"] = (pd.Timestamp.today() - pd.to_datetime(df["last_date"])).dt.days
df["status"] = df["lag_days"].apply(lambda x: "valid" if x <= 1 else "delayed") df["lag_days"] = (pd.Timestamp.today() - pd.to_datetime(df["last_date"])).dt.days
df["status"] = df["lag_days"].apply(lambda x: "valid" if x <= 1 else "delayed") Now, I had a clear view of what modules were fresh and which were weeks behind. Step 2.3: Store the Results in a New Monitoring Table from sqlalchemy import create_engine
#example
engine = create_engine(
    'snowflake://user:password@account/db/schema?warehouse=warehouse'
)

df.to_sql("monitoring_status", con=engine, if_exists="replace", index=False) from sqlalchemy import create_engine
#example
engine = create_engine(
    'snowflake://user:password@account/db/schema?warehouse=warehouse'
)

df.to_sql("monitoring_status", con=engine, if_exists="replace", index=False) Step 3: Show the Pain Visually Using Looker Studio, I connected directly to the monitoring table. Looker Studio I built a dashboard that showed: % of modules delayed per customer
Time since last successful sync
New issues as they appeared % of modules delayed per customer Time since last successful sync New issues as they appeared This made leadership see the gap and prioritize resources. see Step 4: Fix the Root Cause Meanwhile, we fixed the broken parser and restarted all affected pipelines. We also retroactively reprocessed all lost data. Thanks to monitoring, it validated its complete. Outcomes Recovered 100% of lost data
Created a real-time, always-on monitoring dashboard
Reduced future blind spots by enabling automatic validation
Prevented client escalations before they even noticed Recovered 100% of lost data 100% of lost data Created a real-time, always-on monitoring dashboard real-time, always-on monitoring dashboard Reduced future blind spots by enabling automatic validation Prevented client escalations before they even noticed client escalations Final Thoughts Sometimes, all you need is a smart support engineer, a Python script, and a clear view of what’s wrong. If your data platform has millions of moving parts — assume something is broken. Then go build something to catch it.

We Lost Millions of Records: Here's How I Caught It, Fixed It, and Built a Monitoring System in One

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

One Engineer, One Ticket, No Escalations: Rethinking the Support Model

The Noonification: How Often Do NFTs Pass The Howey Test? (1/13/2023)

03/09/2018: Biggest Stories in the Cryptosphere

The Noonification: Immigrant Teens Are Working Dangerous Night Shifts in Factories (11/21/2022)

The Noonification: How to Implement a Merkle Tree in Solidity (11/12/2023)

10 Ways to Optimize Your Database

One Engineer, One Ticket, No Escalations: Rethinking the Support Model

The Noonification: How Often Do NFTs Pass The Howey Test? (1/13/2023)

03/09/2018: Biggest Stories in the Cryptosphere

The Noonification: Immigrant Teens Are Working Dangerous Night Shifts in Factories (11/21/2022)

The Noonification: How to Implement a Merkle Tree in Solidity (11/12/2023)

10 Ways to Optimize Your Database

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps