In this post, we will consider five data quality tools and see how they can help you in your data journey:
Before we dive in with our tools, let’s first set the stage for why ensuring data quality is a business essential.
Today’s businesses are dependent on data more than ever before. According to a recent Gartner data quality market survey, poor data quality costs enterprises an average of $15 million annually. Erroneous data can result in lost business opportunities, bad market reputation, low customer confidence, and significant financial losses.
Broadly, both external and internal factors can cause data quality issues. External causes refer to things that are out of an enterprise’s control such as third-party data. Internal causes stem from problems within the organization’s ecosystem.
External factors can manifest when organizations use outside data sources. For example, different companies’ IT systems may be integrated after a merger or acquisition. Some organizations may ingest data from third-party sources like Meta, Google Analytics, or AWS Data Exchange. A failure to audit data quality sourced from third parties, or poor input validation in applications, can also cause data quality issues.
An example of an internal factor could be enterprise data silos. Data silos are little-known data sources, used by only certain teams or departments within an organization. Another factor is the lack of proper data ownership. Yet another cause can be using wrong data types and data models. On top of this, software engineers at any layer of the application may introduce code updates that change fields and break the data. Any of these factors can cause poor data quality and inconsistencies.
How data quality plays out in an organization largely depends on how that organization defines it and prioritizes it. However, there are seven useful indicators of quality.
Enterprises typically use automated tools for checking data quality. These tools can be either custom developed or vendor supplied. Both choices have their pros and cons.
In-house solutions are feasible if there are IT resources available and data quality requirements are well defined. Roll-your-own solutions can also help cut down capital expenditures. On the other hand, building a custom solution can be time-consuming, and that lost time sometimes outweighs the benefits.
Buying a solution is best if a company needs a reliable solution quickly and doesn’t want to maintain it themselves.
Now that we’ve laid the groundwork, let’s look at our five data quality tools. Of course, this is not an exhaustive list, and there are many other tools in the market. The tools you choose to mix and match for your data quality needs will depend on your use case and budget.
Great Expectations is an open-source library used for validating, documenting, and profiling data. Users define assertions in the form of expectations. An expectation is exactly what the name suggests—it’s the quality you are expecting from the data. Assertions are written in a declarative language. For example, here is a sample assertion where the column
passenger_count value has to be between 1 and 6.
Another feature is automated data profiling. Great Expectations can auto-generate expectations from the data based on its statistics. This can save time, as data quality engineers don’t have to write assertions from scratch.
Once expectations are ready, they can be incorporated into the data pipeline. For example, in Apache Airflow, the data validation step can be defined as a
checkpoint script using the
BashOperator. This will trigger the quality check as data flows through the pipeline. Great Expectations is compatible with most data sources such as CSV files, SQL databases, Spark DataFrames, and Pandas.
Spectacles is a continuous integration (CI) tool designed to validate your project’s LookML. LookML is Looker’s data modeling language. If you aren’t familiar with Looker, it’s a BI platform that allows people who don’t “speak SQL” to analyze and visualize data.
Spectacles validates LookML by running the SQL query behind it and checking for errors. It integrates with GitHub, GitLab, and Azure DevOps. The tool fits nearly any deployment pattern—whether called manually, triggered from a pull request, or run as part of an ETL job. Having Spectacles as part of a CI/CD workflow means LookML queries are automatically validated before the code is deployed into production.
Datafold is a proactive data quality platform that has three main components: Data Diff, Data Catalog with column-level lineage, and Data Monitoring.
Data Diff allows you to compare two datasets (for example, dev and prod) before merging them into production. This helps users to adopt a more proactive development strategy. Data Diff can also be integrated into a team’s CI/CD pipeline, such that diffs can show up alongside code changes in GitHub or GitLab.
As an example, let’s look at the
taxi_trips dataset that comes with Datafold’s sandbox environment. As you can see in the image below, we have run a Data Diff operation between
On the detailed panel on the right, you can select different tabs to get different perspectives on the results.
The Overview tab will contain a rundown of the successful and failed tests.
The Schema section shows whether the columns from both datasets (data type, order of appearance) match.
The Primary Keys section shows what percentage of primary keys are unique, not NULL, and matching between both datasets.
Although the Overview tab is a great source of information, other tabs provide more useful details. For example, the Schemas tab can look like this:
Here, the highlighted column is where the two datasets differ. This saves time because data engineers can concentrate on the content of those two fields.
The Data Catalog lists all the data sources registered with Datafold. It also allows users to find and profile any particular dataset using filters. This can be a huge timesaver for organizations with hundreds—or even thousands—of datasets. It is an especially useful way to uncover anomalies. Particularly, the data lineage feature can help answer questions like the following:
The Data Catalog presents a dashboard like the one shown below. As you scroll down, you will see a list of each column.
The columns provide important details at a glance:
Clicking the icon under Lineage will show something like this:
The graphical lineage diagram helps data engineers quickly find where a column’s values come from. Lineage can be checked for tables, for all or specific columns, and for upstream or downstream relationships.
Datafold’s monitoring feature allows data engineers to write SQL commands to find anomalies and create automated alerts. Alerting is backed by machine learning, which studies the trends and seasonalities in the data to accurately spot anomalies. The images below show such a query:
The query in this example is auto-generated by Datafold. It keeps track of the daily total fare and the total number of trips in the taxi dataset.
Datafold also allows users to examine the trend of anomalies over time as shown in the image below:
Here, the yellow dots represent anomalies with respect to the minimum and maximum values.
dbt is a data transformation workflow tool. It runs the data transformation code against the target database before deployment, showing how the code will affect the data and highlighting any potential problems. dbt does this by running SELECT statements to build the end state of the data based on the transformation logic.
dbt is easy to integrate into the modern BI stack and can be a valuable part of a CI/CD pipeline. It can be run automatically upon a pull request or on a schedule. It has an automatic rollback feature that stops potentially code-breaking changes from getting deployed.
Datafold and dbt can be used together for automating data quality testing. Just like dbt, Datafold can be integrated into the CI/CD pipeline. When used together, they show how your code affects your data.
Evidently is an open-source Python library that analyzes and monitors machine learning models. It generates interactive reports based on Panda DataFrames and CSV files for troubleshooting models and checking data integrity. These reports show model health, data drift, target drift, data integrity, feature analysis, and performance by segment.
To see how Evidently can help, you can open up a new notebook on Google Colab and copy the following code snippet:
!pip install evidently import pandas as pd import math from sklearn import datasets from evidently.dashboard import Dashboard from evidently.tabs import DataDriftTab wine = datasets.load_wine() wine_frame = pd.DataFrame(wine.data, columns = wine.feature_names) number_of_rows = len(wine_frame) wine_data_drift_report = Dashboard(tabs=[DataDriftTab]) wine_data_drift_report.calculate(wine_frame[:math.floor(number_of_rows/2)], wine_frame[math.floor(number_of_rows/2):], column_mapping = None) wine_data_drift_report.save("report_1.html")
This code will generate and load a report in the browser.
The report’s dashboard overview will show the distribution of references and current values based on every feature.
More close-up inspection is possible from here. For example, the graph below shows the exact values where current and reference datasets differ.
Evidently has nifty features like numerical target drift, categorical target drift, regression model performance, classification model performance, and probabilistic classification model performance.
Data quality standards and business requirements are constantly evolving, making the task of ensuring data quality a continuous journey. The introduction of bad data into your data-driven decision-making will have many consequences—all of which are detrimental to the business.
That’s why it’s necessary to maintain standards of data quality at every step in the data journey.
The five tools discussed here can be used during different phases of data processing and usage. In your own organization’s data journey, it’s worth testing out some of these tools individually—or in different combinations—to see how they can benefit your use cases.