I can't say how many times I've asked myself these questions, or how many times I heard those when I talk with data engineers, analytics engineers or heads of data. The answer never was rally exciting. It was something like: "well, check out the source code and map dependencies in an excel" or "ask Brian, he knows". Obviously, that poor Brian was an incredible data engineer, but the reason he was so good was that he wasn't spending his time answering to my relentless pings on Slack.
Spoiler Alert: that's why people invented data lineage !
Data lineage is a technology that retraces the relationships between data assets. In the data world, you start by collecting raw data from various sources (logs from your website, payments, etc) and refine this data by applying successive transformations. In order to build one data table (we will call it "child table"), you have to use one or more other data tables (we call these "parent tables").
The data lineage helps you rebuild the family tree of your data. If you have bad data, you can look for the 'bad branch' in that data family tree and cure it to the roots.
In the graph below, each card represents a data resources like a table, a dashboard or a report. Each link represents a transformation or a data job (orchestrated by Airflow for example).
'Data lineage is like a family tree but for data'
Data lineage flow can exist under different format. You can either trace parent-child relationships between data tables or columns in a specific table. Indeed, each column is created with other columns from other tables, sometimes even from the same table.
Column lineage is a subset of table lineage. If there is a link between two columns from different tables, then there is a link between the two tables. The opposite isn't always true. Column lineage is providing the parent-child relationship at a column-level without taking into account the tables underneath. It is a really powerful kind of lineage but can quickly be extremely hard to visualize as the number of connections grows. Column-level lineage system is also particularly hard to compute. Processing column-lineage more expensive, takes longer and requires a finer version of the parsing algorithm to understand the changes.
Example of what column-lineage flow looks like in Castor.
In the SELECT SQL statement below, we are building a table child_table_1 with two other tables parent_table_1 (FROM clause) and parent_table_2 (JOIN clause). That means that if there is a change or a problem in one of the two "parent tables", the child table might be affected by the change or the bug.
Example of SQL query that can be parsed to find parent and column-level lineage links.
In the SELECT SQL statement above, we are building three columns child_1, child_2, and child_3 in the child_table_1. The two first "child columns" are the same as the "parent columns" from parent_table_1 and parent_table_2. There is a direct lineage dependence.
The child_3 column is a sum of parent_table_1.parent_column1 and parent_table_1.parent_column1. There's a lineage dependence here as well.
"Data lineage gives a bird's eye view of a data flow."
It helps trace back relationships across data systems
A few years ago, lineage was mostly about tracing relationships in the data lake or the data warehouse. This means a system tracking relationships between data tables. Now data lineage is cross-system.
Those tables are often powering BI tools such as Looker, Tableau, Mode, Metabase, PowerBI etc. Mapping this flow proved really powerful for business users as well as engineers. Indeed, business users can fact-check the provenance and flow of the data easily while engineers can check the impact of a change on the business.
End-to-end data lineage system from the data sources to the BI tools.
End-to-end lineage, from data sources (ETL) to business reports, is possible now that most BI tools have APIs, and that BI tools perform SQL queries on the data warehouse in a standardized way. When data catalog providers had to build as many versions of their lineage parser for each new tool, language and integration, they can now build a lineage engine and wait customize the last mile connector.
First of all, data lineage is a technology more than a product. It is one of the underlying technologies behind a lot of data products (data catalogs, data quality, data management, etc).
The reason data lineage is so popular is that there are a lot of new use-cases, both for business, engineering, leadership, and legal department. Let me try to explain the most common use-cases. Overall, data lineage helps get a bird's eye view of the data systems.
End-to-end data lineage is extremely powerful to troubleshoot data problems.
Data users sometimes spot weird behavior in dashboards or data tables. They want to investigate to know if this is due to a technical problem in the data pipelines or a business problem that needs attention right now. There's nothing more humiliating for a data person to sound the alarm about a business problem (for example, a drop in revenue for the "active users" segment) when it really is a data pipeline that broke.
In the example above, the data in the dashboard drops drastically. The business user wants to understand why this happened. The underlying question is: "do I have a business problem or a data pipeline problem?". He can check the status of all upstream data. In this case, the problem comes from a data source (the app store) that changed its API. This change broke the ETL job feeding data sources. In a few minutes, the user can understand the data system as a whole.
Data lineage helps you visualize the impact of a change
Data users want to know the impact of a change they want to make on the data flow, and on data resources downstream. In most organizations, data reports and dashboards are powered by data tables in the data lake. If an engineer makes a change, this will likely break data pipelines and processes. As a result, overall data quality and trust within the organization will drastically decrease.
Thanks to the lineage, engineers can identify all impacted resources and craft their code to minimize bad impact. If there is an impact, they can identify the data resources affected and send an warning to active users of those resources.
Data Lineage helps you trust your data by knowing where it comes from
Users want to identify, understand and trust data resources (tables, reports, and dashboards) that they want to consume. When a user looks at data resources, knowing where it comes from helps to trust it. A saying says "garbage in, garbage out". Knowing what is the data powering the dashboard on "active users" will give you information on how to read it and if you can trust it. Lineage will let you know that it is plugged to an "Approved by BI" table, with a Gold rating, and help you trust it faster.
Data lineage can help you propagate definitions across your data warehouse by identifying identical data points.
From a governance perspective, data lineage proved extremely powerful to propagate metadata. Users can easily propagate a definition across the data systems.
If a column is used in another table, without ongoing any transformation than it is fair to assume that in 99% of the cases the definition will be the same. Now, imagine your data catalog is connected to the column-level lineage metadata, you can be 10x more efficient in definition governance by documenting only the sources. If a popular column is used across hundreds of tables without any transformation, you can safely propagate the definition and attributes in a click.
Data lineage helps data governance teams from big data or enterprise companies deploy metadata management efforts in a scalable way. Indeed, in an environment with big data resources and numerous databases, understanding how the data flows and its provenance can be challenging. There is a lot of repetitive tasks.
Data lineage can help you tag personal information and track access rights
The logic is really similar to the logic above. Tagging personal or sensitive information can be a real pain. Today, it is a time consuming task. Thanks to the lineage you can understand and track access rights along with dependence.Often you realize, you have some weakness in your roles policy. You have don't have access to a child table or a parent table, but you have access to the middle table. This can be right, but it can also be dangerous as you can reconstruct/trace back the sensitive data.
Data lineage is really powerful to improve data governance, access rights management and comply to GDPR regulation.
Data lineage can help identify resources you need to power the most valuable data resources
Data teams sometimes want to switch technologies. For example, an enterprise company moving from an old on-premise Oracle database to a cloud data warehouse like Snowflake or BigQuery. These big migrations are really expensive as everything is connected to everything. Tracing back dependence is a business critical process before even starting such a project.
As part of this process, you want to identify data resources that you can delete versus those you want to migrate. This bird's eye view helps leading architecture thinking.
Data lineage recently gained traction. There are several reasons to that. The first one is that entreprises have more data, sooner than 10 years ago. There's an increased need for metadata management tools. The second is that the data industry structured itself with tools and processes. This makes computing lineage for cloud technologies and companies much more easy and reliable. The last reason is the emergence of harsh data regulations across the continents (GDPR, HIPAA, etc) that makes lineage a must-have.
Those three structural changes are compounding trends. They will grow drastically over time. Companies will continue to gather more data and recruit more people to analyse it. The data industry will continue to structure itself and bring standard processes. Compliance to data regulation is today more a problem of entreprise companies but will soon arrive to mid-market and SMBs as tooling will make it easier to assess it.
For all these reasons, data lineage is entering the space for good and you might want to start investing in such a tool sooner than later.
Find more benchmarks and analysis on the modern data stack here. We write about all the processes involved when leveraging data assets: from the modern data stack to data teams composition, to data governance. Our blog covers the technical and the less technical aspects of creating tangible value from data. If you're a data leader and would like to discuss these topics in more depth, join the community we've created for that!
At Castor, we are building a data documentation tool for the Notion, Figma, Slack generation. Or data-wise for the Fivetran, Looker, Snowflake, DBT aficionados. We designed our catalog to be easy to use, delightful and friendly. Want to check it out? Reach out to us and we'll show you a demo.