Businesses tend to want to use data in exciting, high-risk-high-reward ways. They want the technology and automation that can process and analyze higher volumes of data.
But as they implement new tools, there’s risk involved. Specifically, the risk of end users noticing or being inconvenienced when changes occur to the data pipeline. Ideally, new sources of data could onboard, new transformations could be written, existing transformations could be extended, and useless data end-of-life’d; all without the end users giving it a second thought.
More often though, changes in the data pipeline can break something that impacts users. With blind spots in the data model, businesses can’t predict how data (including downstream dependencies) changes or impacts users.
Data observability can create visibility into the data model, and enable data teams to predict how changes in one part of the model will cascade downstream, plus trace problems to their upstream sources.
Companies like Uber, Airbnb, Netflix, and others use data observability to help cover data blind spots. Essentially any business looking to take care of the following should consider a data observability program:
Data observability tools — like their DevOps counterparts — help data engineering teams achieve the above goals with continuous information about the data model’s behavior.
With data observability, engineers see inside the tables in their data stores, when they’re being written to or read from, where they come from and where they go. In other words, data observability tools give a bird’s eye view of the entire data model. They supply both the foresight to predict and the early-warning capabilities to detect data changes (e.g. changing a schema) or infrastructure (e.g. reducing how often an ELT job runs) that impact users.
Ultimately, data observability is built on three core blocks that create context for data engineers: metrics, logs, and lineage.
Metrics are numbers that come from direct queries to the dataset. Examples: a table’s row count, the average value of a numeric column, or the number of duplicate user_uuid’s. Metrics are familiar to anyone who has run a SQL query; they’re what returns from any query with aggregated results (e.g. the average value of a numeric column). In an observability context, the a metric quantifies what’s happening inside the dataset, instead of answering a business question.
Metrics answer questions about the internal state of each table across the data model like:
Count of duplicate user_uuids: are we recording duplicate user records?
Percent valid formatted user_emails: are we recording valid email addresses for our users?
Skew of transaction_amount: did the distribution of our payments suddenly shift?
Metrics describe the behavior inside the table. Collecting metrics requires an understanding of the table under observation. Each table needs a unique set of metrics to accurately describe its behavior. This can be a big barrier to instrumentation (imagine hand-picking the right set of metrics for 100+ tables with 50–75 columns each). Data observability tools seek to automate this process with techniques like data profiling, and/or their own secret sauce of picking the right metrics.
Metadata contains (but isn’t limited to!) information about physical data or related concepts (ELT tables, for example). Metadata could be: a log of queries run against a given Snowflake table, or logs produced by an Airflow job. Metadata won’t tell you about the data itself. It will tell you what’s being done to the data by the infrastructure, for example running INSERT to append new rows to a table, or discovering a failed Airflow job that didn’t write anything to its intended destination.
Metadata answers questions about what’s happening TO the data, rather than what’s IN the data. Metadata can answer questions like:
How long has it been since this table was written to?
How many rows were inserted when that happened?
How long did this ELT job take to run?
Collecting metadata is a bit simpler than collecting metrics (which are configured uniquely for every table) or lineage (which is merged together from multiple sources). Tools like Snowflake make metadata queryable, like any other table, and Fivetran dumps metadata into the destination schema where it is similarly queried. After that process, data is parsed for relevant statistics from the logged queries, with tracking on time-since-write, rows-inserted, job-run-duration, etc.
The DataOps-space has chosen the term “linear” over “traces” in the DevOps-space. Lineage refers to the path the data took from creation, through any databases and transformation jobs, all the way down to final destinations like analytics dashboards or ML feature stores.
Often, lineage is constructed by parsing logs . However, it stands on its own as a concept because of the role it plays in behavioral analysis of the overall data model.
Lineage helps to answer questions about where something happened, or where something is going to end up having an impact. For example:
If I change the schema of this table, what other tables will start having problems?
If I see a problem in this table, how do I know whether it flowed here due to a problem elsewhere?
If I have an accuracy issue in this table, who’s looking at the dashboards that ultimately depend on it?
Lineage is most often collected through log parsing queries that write into each table. By modeling what’s happening inside the query, you discover which tables are being read from, and which tables are being written into.
Lineage can (and should) go further than just table-to-table or column-to-column relationships. Companies like AirBnB and Uber have been modeling lineage all the way upstream to the source database or Kafka topic, and all the way downstream to the user level. In doing so, they can communicate data problems or changes all the way up to the relevant humans.
When you merge metrics, logs, and lineage, you provide the data operator with crucial information. First off, you provide what’s going on INSIDE the tables and whether that’s changing over time, what’s happening TO the tables via queries and ELT jobs, and what’s the relationship BETWEEN the tables and how problems will flow across the graph.
Other aspects have to be built correctly in order for a data observability tool to be useful, but these three building blocks are the primary sources of information enabling everything else.
With observability, data operators the information from metrics, logs, and linear to understand the state of their complete data model, ship improvements faster, and deliver more reliable data to their end users. As data engineering evolves, it’s likely that data observability nomenclature goes more mainstream, as its cousin DevOps has already done.
Also published here.