Data quality ensures that an organization’s data is accurate, consistent, complete, and reliable. The quality of the data dictates how useful it is to the enterprise. Ensuring data quality—especially with the sheer amount of data available to today’s enterprises—is a tremendously complex beast. To do this effectively, the modern enterprise relies on data quality tools. In this post, we will consider five data quality tools and see how they can help you in your data journey: Great Expectations Spectacles Datafold dbt (Data Build Tool) Evidently Before we dive in with our tools, let’s first set the stage for ensuring data quality is a business essential. why Why Bother with Data Quality? Today’s businesses are dependent on data more than ever before. , poor data quality costs enterprises an average of $15 million annually. Erroneous data can result in lost business opportunities, bad market reputation, low customer confidence, and significant financial losses. According to a recent Gartner data quality market survey How Do We Get Bad Data? Broadly, both external and internal factors can cause data quality issues. External causes refer to things that are out of an enterprise’s control such as third-party data. Internal causes stem from problems within the organization’s ecosystem. External factors can manifest when organizations use outside data sources. For example, different companies’ IT systems may be integrated after a merger or acquisition. Some organizations may ingest data from third-party sources like Meta, Google Analytics, or AWS Data Exchange. A failure to audit data quality sourced from third parties, or poor input validation in applications, can also cause data quality issues. An example of an internal factor could be enterprise data silos. Data silos are little-known data sources, used by only certain teams or departments within an organization. Another factor is the lack of proper data ownership. Yet another cause can be using wrong data types and data models. On top of this, software engineers at any layer of the application may introduce code updates that change fields and break the data. Any of these factors can cause poor data quality and inconsistencies. Measuring Data Quality How data quality plays out in an organization largely depends on how that organization defines it and prioritizes it. However, there are seven useful indicators of quality. : how relevant the data is for the business. Relevancy : how accurate the data is. Accuracy : whether the data is complete and in a stable state. Completeness : how consistent the data is across the organization. If the same piece of data from multiple sources is transformed using the same application, the output should always be the same. Consistency : whether the data adheres to the standards and format expected by business rules. Conformity : whether multiple copies of the same data are available in the enterprise or if it’s the single source of truth. Uniqueness : how timely the data is for current business needs. Timeliness Ensuring Data Quality Enterprises typically use automated tools for checking data quality. These tools can be either custom developed or vendor supplied. Both choices have their pros and cons. In-house solutions are feasible if there are IT resources available and data quality requirements are well defined. Roll-your-own solutions can also help cut down capital expenditures. On the other hand, building a custom solution can be time-consuming, and that lost time sometimes outweighs the benefits. Buying a solution is best if a company needs a reliable solution quickly and doesn’t want to maintain it themselves. Now that we’ve laid the groundwork, let’s look at our five data quality tools. Of course, this is not an exhaustive list, and there are many other tools in the market. The tools you choose to mix and match for your data quality needs will depend on your use case and budget. Great Expectations is an open-source library used for validating, documenting, and profiling data. Users define in the form of . An expectation is exactly what the name suggests—it’s the quality you are expecting from the data. Assertions are written in a declarative language. For example, here is a sample assertion where the column value has to be between 1 and 6. Great Expectations assertions expectations passenger_count Another feature is automated data profiling. Great Expectations can auto-generate expectations from the data based on its statistics. This can save time, as data quality engineers don’t have to write assertions from scratch. Once expectations are ready, they can be incorporated into the data pipeline. For example, in Apache Airflow, the data validation step can be defined as a using the . This will trigger the quality check as data flows through the pipeline. Great Expectations is compatible with most data sources such as CSV files, SQL databases, Spark DataFrames, and Pandas. checkpoint script BashOperator Spectacles is a continuous integration (CI) tool designed to validate your project’s LookML. is Looker’s data modeling language. If you aren’t familiar with , it’s a BI platform that allows people who don’t “speak SQL” to analyze and visualize data. Spectacles LookML Looker Spectacles validates LookML by running the SQL query behind it and checking for errors. It integrates with GitHub, GitLab, and Azure DevOps. The tool fits nearly any deployment pattern—whether called manually, triggered from a pull request, or run as part of an ETL job. Having Spectacles as part of a CI/CD workflow means LookML queries are automatically validated before the code is deployed into production. Datafold is a proactive data quality platform that has three main components: Data Diff, Data Catalog with column-level lineage, and Data Monitoring. Datafold Data Diff allows you to compare two datasets (for example, dev and prod) before merging them into production. This helps users to adopt a more proactive development strategy. Data Diff can also be integrated into a team’s CI/CD pipeline, such that diffs can show up alongside code changes in GitHub or GitLab. As an example, let’s look at the dataset that comes with Datafold’s sandbox environment. As you can see in the image below, we have run a Data Diff operation between and datasets. taxi_trips datadiff-demo.public.taxi_trips datadiff-demo.dev.taxi_trips On the detailed panel on the right, you can select different tabs to get different perspectives on the results. The tab will contain a rundown of the successful and failed tests. Overview The section shows whether the columns from both datasets (data type, order of appearance) match. Schema section shows what percentage of primary keys are unique, not NULL, and matching between both datasets. The Primary Keys Although the tab is a great source of information, other tabs provide more useful details. For example, the tab can look like this: Overview Schemas Here, the highlighted column is where the two datasets differ. This saves time because data engineers can concentrate on the content of those two fields. The lists all the data sources registered with Datafold. It also allows users to find and profile any particular dataset using filters. This can be a huge timesaver for organizations with hundreds—or even thousands—of datasets. It is an especially useful way to uncover anomalies. Particularly, the data lineage feature can help answer questions like the following: Data Catalog Where is this value coming from? How is this value affecting other tables? How are these tables related? The Data Catalog presents a dashboard like the one shown below. As you scroll down, you will see a list of each column. The columns provide important details at a glance: refers to what percentage of the values are not NULL. Completeness shows which values appear the most and the least. It will also show if the values are skewed towards a certain range. Distribution Clicking the icon under will show something like this: Lineage The graphical lineage diagram helps data engineers quickly find where a column’s values come from. Lineage can be checked for tables, for all or specific columns, and for upstream or downstream relationships. Datafold’s monitoring feature allows data engineers to write SQL commands to find anomalies and create automated alerts. Alerting is backed by machine learning, which studies the trends and seasonalities in the data to accurately spot anomalies. The images below show such a query: The query in this example is auto-generated by Datafold. It keeps track of the daily total fare and the total number of trips in the taxi dataset. Datafold also allows users to examine the trend of anomalies over time as shown in the image below: Here, the yellow dots represent anomalies with respect to the minimum and maximum values. dbt is a data transformation workflow tool. It runs the data transformation code against the target database deployment, showing how the code will affect the data and highlighting any potential problems. dbt does this by running SELECT statements to build the end state of the data based on the transformation logic. dbt before dbt is easy to integrate into the modern BI stack and can be a valuable part of a CI/CD pipeline. It can be run automatically upon a pull request or on a schedule. It has an automatic rollback feature that stops potentially code-breaking changes from getting deployed. Datafold and dbt can be used together for automating data quality testing. Just like dbt, Datafold can be integrated into the CI/CD pipeline. When used together, they show how your code affects your data. Evidently is an open-source Python library that analyzes and monitors machine learning models. It generates interactive reports based on Panda DataFrames and CSV files for troubleshooting models and checking data integrity. These reports show model health, data drift, target drift, data integrity, feature analysis, and performance by segment. Evidently To see how Evidently can help, you can open up a new notebook on Google Colab and copy the following code snippet: !pip install evidently

import pandas as pd
import math
from sklearn import datasets
from evidently.dashboard import Dashboard
from evidently.tabs import DataDriftTab

wine = datasets.load_wine()
wine_frame = pd.DataFrame(wine.data, columns = wine.feature_names)

number_of_rows = len(wine_frame)

wine_data_drift_report = Dashboard(tabs=[DataDriftTab])
wine_data_drift_report.calculate(wine_frame[:math.floor(number_of_rows/2)], wine_frame[math.floor(number_of_rows/2):], column_mapping = None)

wine_data_drift_report.save("report_1.html") This code will generate and load a report in the browser. The report’s dashboard overview will show the distribution of references and current values based on every feature. More close-up inspection is possible from here. For example, the graph below shows the exact values where current and reference datasets differ. Evidently has nifty features like numerical target drift, categorical target drift, regression model performance, classification model performance, and probabilistic classification model performance. Final Words Data quality standards and business requirements are constantly evolving, making the task of ensuring data quality a continuous journey. The introduction of bad data into your data-driven decision-making will have many consequences—all of which are detrimental to the business. That’s why it’s necessary to maintain standards of data quality at every step in the data journey. The five tools discussed here can be used during different phases of data processing and usage. In your own organization’s data journey, it’s worth testing out some of these tools individually—or in different combinations—to see how they can benefit your use cases.

The Graph

Alongside

Apache

Google

Target

Timely

Yellow

Migrating to Snowflake, Redshift, or BigQuery? Avoid these Common Pitfalls

Build a Lightweight REST API with Salesforce and Postman

I write about technology michael.bogan@gmail.com

Nominated for 2022 - HackerNoon Contributor of the Year - Software Architecture

Nominated for 2022 - HackerNoon Contributor of the Year - Microservices

Nominated for 2022 - HackerNoon Contributor of the Year - Devops

Nominated for 2022 - HackerNoon Contributor of the Year - Api

Nominated for 2022 - Software Developer of the Year

Nominated for 2022 - HackerNoon Contributor of the Year - Software

Too Long; Didn't Read

Five Data Quality Tools You Should Know

Five Data Quality Tools You Should Know

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

10 Threats to an Open API Ecosystem

Employees Are Leaving Companies Because of Bad Data

Handling Data Integrity Issues Like a Pro

When Big Data Goes Bad: Rehabilitating Data Quality

The Hidden Cost of Bad Data: Why It’s Undermining Your AI Strategy

Architecting Trustworthy Healthcare Data Platforms Using Declarative Pipelines

10 Threats to an Open API Ecosystem

Employees Are Leaving Companies Because of Bad Data

Handling Data Integrity Issues Like a Pro

When Big Data Goes Bad: Rehabilitating Data Quality

The Hidden Cost of Bad Data: Why It’s Undermining Your AI Strategy

Architecting Trustworthy Healthcare Data Platforms Using Declarative Pipelines

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps