QA Checks for Big Datasets With Deequ & Statistical Methods

One of the vital skills of an accomplished data professional is the effective handling of large datasets, ensuring data quality and reliability. Data is the central and fundamental piece of any data system, and whatever good skills you have in other aspects of our trade, this is one you cannot afford to overlook. In this article, I explore robust techniques for performing QA checks on large datasets using the Deequ library and statistical methods. By combining the approaches I explain below, you will be able to maintain data integrity, enhance your data management practices, and prevent potential issues in downstream applications. QA Checks Using the Deequ Library Why Deequ? Ensuring data quality at scale is a daunting task, especially when dealing with billions of rows stored in distributed file systems or data warehouses. The Deequ library is an open-source data profiling and QA framework built on Spark that is a modern and versatile tool designed to solve this problem. What sets it apart from similar tools is its ability to integrate seamlessly with Spark, leveraging distributed processing power for efficient handling of large-scale datasets. When you try it, you will see how its flexibility lets you define complex validation rules tailored to your specific requirements, ensuring comprehensive coverage. Additionally, Deequ features extensive metrics and anomaly detection capabilities that will help you identify and proactively address data quality issues. For data professionals working with large and dynamic datasets, Deequ is a Swiss-knife solution. Let’s see how we can use it. Setting Up Deequ More details on the Deequ library setup and use cases around data profiling are accessible here. For the sake of simplicity, in this example, we just generated a few toy records: val rdd = spark.sparkContext.parallelize(Seq( Item(1, "Thingy A", "awesome thing.", "high", 0), Item(2, "Thingy B", "available at http://thingb.com", null, 0), Item(3, null, null, "low", 5), Item(4, "Thingy D", "checkout https://thingd.ca", "low", 10), Item(5, "Thingy E", null, "high", 12))) val data = spark.createDataFrame(rdd) Defining Data Assumptions Most data applications come with implicit assumptions about data attributes, such as non-NULL values and uniqueness. With Deequ, these assumptions become explicit through unit tests. Here are some common checks: Row Count: Ensure the dataset contains a specific number of rows. Attribute Completeness: Check that attributes like id and productName are never NULL. Attribute Uniqueness: Ensure that certain attributes, such as id, are unique. Value Range: Validate that attributes like priority and numViews fall within expected ranges. Pattern Matching: Verify that descriptions contain URLs when expected. Statistical Properties: Ensure the median of numerical attributes meets specific criteria. Here is how you can implement these checks using Deequ: import com.amazon.deequ.VerificationSuite import com.amazon.deequ.checks.{Check, CheckLevel, CheckStatus} val verificationResult = VerificationSuite() .onData(data) .addCheck( Check(CheckLevel.Error, "unit testing my data") .hasSize(_ == 5) // we expect 5 rows .isComplete("id") // should never be NULL .isUnique("id") // should not contain duplicates .isComplete("productName") // should never be NULL // should only contain the values "high" and "low" .isContainedIn("priority", Array("high", "low")) .isNonNegative("numViews") // should not contain negative values // at least half of the descriptions should contain a url .containsURL("description", _ >= 0.5) // half of the items should have less than 10 views .hasApproxQuantile("numViews", 0.5, _ <= 10)) .run() Interpreting Results After running these checks, Deequ translates them into a series of Spark jobs, which it executes to compute metrics on the data. Afterward, it invokes your assertion functions (e.g., _ == 5 for the size check) on these metrics to see if the constraints hold on the data. We can inspect the object "verificationResult” to see if the test found errors: import com.amazon.deequ.constraints.ConstraintStatus if (verificationResult.status == CheckStatus.Success) { println("The data passed the test, everything is fine!") } else { println("We found errors in the data:\n") val resultsForAllConstraints = verificationResult.checkResults .flatMap { case (_, checkResult) => checkResult.constraintResults } resultsForAllConstraints .filter { _.status != ConstraintStatus.Success } .foreach { result => println(s"${result.constraint}: ${result.message.get}") } } If we run the example, we get the following output: We found errors in the data: CompletenessConstraint(Completeness(productName)): Value: 0.8 does not meet the requirement! PatternConstraint(containsURL(description)): Value: 0.4 does not meet the requirement! The test found that our assumptions were violated! Only 4 out of 5 (80%) of the values of the productName attribute are non-null, and only 2 out of 5 (i.e., 40%) values of the description attribute did contain a URL. Fortunately, we ran a test and found the errors; somebody should immediately fix the data! QA Checks With Statistical Methods While Deequ offers a robust framework for data validation, integrating statistical methods can further enhance your QA checks, especially if you are dealing with aggregated metrics of a dataset. Let’s see how you can employ statistical methods to monitor and ensure data quality. Record Count Tracking Consider a business scenario where an ETL (Extract, Transform, Load) process produces N records on a daily scheduled job. Support teams may want to set up QA checks to raise an alert if there is a significant deviation in the record count. For instance, if the process typically generates between 9,500 to 10,500 records daily over two months, any significant increase or decrease could indicate an issue with the underlying data. We can use a statistical method to define this threshold on which process should raise an alert to the support team. Below is an illustration of record count tracking over two months: To analyze this, we can transform the record count data to observe the day-to-day changes. These changes generally oscillate around zero, as shown in the following chart: When we represent this rate of change with a normal distribution, it forms a bell curve, indicating that the data is distributed normally. The expected change is around 0%, with a standard deviation of 2.63%. This analysis suggests that the record count typically falls within the -5.26% to +5.25% range with 90% confidence. Based on this, you can establish a rule to raise an alert if the record count deviates beyond this range, ensuring timely intervention. Attribute Coverage Tracking Attribute coverage refers to the ratio of non-NULL values to the total record count for a dataset snapshot. For example, if 8 out of 100 records have a NULL value for a particular attribute, the coverage for that attribute is 92%. Let’s review another business case with an ETL process generating a product table snapshot daily. We want to monitor the coverage of product description attributes. If the coverage falls below a certain threshold, an alert should be raised for the support team. Below is a visual representation of attribute coverage for product descriptions over two months: By analyzing the absolute day-to-day differences in coverage, we observe that the changes oscillate around zero: Representing this data as a normal distribution shows that it is normally distributed with an expected change of around 0% and a standard deviation of 2.45%. As we see, for this dataset, the product description attribute coverage typically ranges from -4.9% to +4.9% with 90% confidence. Based on this indicator, we can set a rule to raise an alert if the coverage deviates beyond this range. QA Checks With Time Series Algorithms If you work with datasets that show significant variations due to factors like seasonality or trends, traditional statistical methods might trigger false alerts. Time series algorithms offer a more refined approach, improving the accuracy and reliability of your QA checks. To produce more sensible alerts, you can use either the Autoregressive Integrated Moving Average (ARIMA) or the Holt-Winters Method. The former is good enough for datasets with trends, but the latter lets us deal with datasets with both trend and seasonality. This method uses components for level, trend, and seasonality, which allows it to adapt flexibly to changes over time. Let’s mock-model daily sales that exhibit both trend and seasonal patterns using Holt-Winters: import pandas as pd from statsmodels.tsa.holtwinters import ExponentialSmoothing # Load and preprocess the dataset data = pd.read_csv('sales_data.csv', index_col='date', parse_dates=True) data = data.asfreq('D').fillna(method='ffill') # Fit the Holt-Winters model model = ExponentialSmoothing(data, trend='add', seasonal='add', seasonal_periods=365) fit = model.fit() # Forecast and detect anomalies forecast = fit.fittedvalues residuals = data - forecast threshold = 3 * residuals.std() anomalies = residuals[abs(residuals) > threshold] print("Anomalies detected:") print(anomalies) Using this method, you can detect significant deviations that might indicate data quality issues, providing a more nuanced approach to QA checks. I hope this article will help you efficiently implement QA checks for your large datasets. By using the Deequ library and integrating statistical methods and time series algorithms, you can ensure data integrity and reliability, ultimately enhancing your data management practices. Implementing the techniques described above will help you prevent potential issues in downstream applications and improve the overall quality of your data workflows. One of the vital skills of an accomplished data professional is the effective handling of large datasets, ensuring data quality and reliability. Data is the central and fundamental piece of any data system, and whatever good skills you have in other aspects of our trade, this is one you cannot afford to overlook. In this article, I explore robust techniques for performing QA checks on large datasets using the Deequ library and statistical methods. By combining the approaches I explain below, you will be able to maintain data integrity, enhance your data management practices, and prevent potential issues in downstream applications. QA Checks Using the Deequ Library QA Checks Using the Deequ Library Why Deequ? Why Deequ? Ensuring data quality at scale is a daunting task, especially when dealing with billions of rows stored in distributed file systems or data warehouses. The Deequ library is an open-source data profiling and QA framework built on Spark that is a modern and versatile tool designed to solve this problem. What sets it apart from similar tools is its ability to integrate seamlessly with Spark, leveraging distributed processing power for efficient handling of large-scale datasets. When you try it, you will see how its flexibility lets you define complex validation rules tailored to your specific requirements, ensuring comprehensive coverage. Additionally, Deequ features extensive metrics and anomaly detection capabilities that will help you identify and proactively address data quality issues. For data professionals working with large and dynamic datasets, Deequ is a Swiss-knife solution. Let’s see how we can use it. Setting Up Deequ Setting Up Deequ More details on the Deequ library setup and use cases around data profiling are accessible here . For the sake of simplicity, in this example, we just generated a few toy records: here val rdd = spark.sparkContext.parallelize(Seq( Item(1, "Thingy A", "awesome thing.", "high", 0), Item(2, "Thingy B", "available at http://thingb.com", null, 0), Item(3, null, null, "low", 5), Item(4, "Thingy D", "checkout https://thingd.ca", "low", 10), Item(5, "Thingy E", null, "high", 12))) val data = spark.createDataFrame(rdd) val rdd = spark.sparkContext.parallelize(Seq( Item(1, "Thingy A", "awesome thing.", "high", 0), Item(2, "Thingy B", "available at http://thingb.com", null, 0), Item(3, null, null, "low", 5), Item(4, "Thingy D", "checkout https://thingd.ca", "low", 10), Item(5, "Thingy E", null, "high", 12))) val data = spark.createDataFrame(rdd) Defining Data Assumptions Defining Data Assumptions Most data applications come with implicit assumptions about data attributes, such as non-NULL values and uniqueness. With Deequ, these assumptions become explicit through unit tests. Here are some common checks: Row Count: Ensure the dataset contains a specific number of rows. Attribute Completeness: Check that attributes like id and productName are never NULL. Attribute Uniqueness: Ensure that certain attributes, such as id, are unique. Value Range: Validate that attributes like priority and numViews fall within expected ranges. Pattern Matching: Verify that descriptions contain URLs when expected. Statistical Properties: Ensure the median of numerical attributes meets specific criteria. Row Count: Ensure the dataset contains a specific number of rows. Row Count: Ensure the dataset contains a specific number of rows. Row Count: Attribute Completeness: Check that attributes like id and productName are never NULL. Attribute Completeness: Check that attributes like id and productName are never NULL. Attribute Completeness: Attribute Uniqueness: Ensure that certain attributes, such as id, are unique. Attribute Uniqueness: Ensure that certain attributes, such as id, are unique. Attribute Uniqueness: Value Range: Validate that attributes like priority and numViews fall within expected ranges. Value Range: Validate that attributes like priority and numViews fall within expected ranges. Value Range: Pattern Matching: Verify that descriptions contain URLs when expected. Pattern Matching: Verify that descriptions contain URLs when expected. Pattern Matching: Statistical Properties: Ensure the median of numerical attributes meets specific criteria. Statistical Properties: Ensure the median of numerical attributes meets specific criteria. Statistical Properties: Here is how you can implement these checks using Deequ: import com.amazon.deequ.VerificationSuite import com.amazon.deequ.checks.{Check, CheckLevel, CheckStatus} val verificationResult = VerificationSuite() .onData(data) .addCheck( Check(CheckLevel.Error, "unit testing my data") .hasSize(_ == 5) // we expect 5 rows .isComplete("id") // should never be NULL .isUnique("id") // should not contain duplicates .isComplete("productName") // should never be NULL // should only contain the values "high" and "low" .isContainedIn("priority", Array("high", "low")) .isNonNegative("numViews") // should not contain negative values // at least half of the descriptions should contain a url .containsURL("description", _ >= 0.5) // half of the items should have less than 10 views .hasApproxQuantile("numViews", 0.5, _ <= 10)) .run() import com.amazon.deequ.VerificationSuite import com.amazon.deequ.checks.{Check, CheckLevel, CheckStatus} val verificationResult = VerificationSuite() .onData(data) .addCheck( Check(CheckLevel.Error, "unit testing my data") .hasSize(_ == 5) // we expect 5 rows .isComplete("id") // should never be NULL .isUnique("id") // should not contain duplicates .isComplete("productName") // should never be NULL // should only contain the values "high" and "low" .isContainedIn("priority", Array("high", "low")) .isNonNegative("numViews") // should not contain negative values // at least half of the descriptions should contain a url .containsURL("description", _ >= 0.5) // half of the items should have less than 10 views .hasApproxQuantile("numViews", 0.5, _ <= 10)) .run() Interpreting Results Interpreting Results After running these checks, Deequ translates them into a series of Spark jobs, which it executes to compute metrics on the data. Afterward, it invokes your assertion functions (e.g., _ == 5 for the size check) on these metrics to see if the constraints hold on the data. We can inspect the object "verificationResult” to see if the test found errors: import com.amazon.deequ.constraints.ConstraintStatus if (verificationResult.status == CheckStatus.Success) { println("The data passed the test, everything is fine!") } else { println("We found errors in the data:\n") val resultsForAllConstraints = verificationResult.checkResults .flatMap { case (_, checkResult) => checkResult.constraintResults } resultsForAllConstraints .filter { _.status != ConstraintStatus.Success } .foreach { result => println(s"${result.constraint}: ${result.message.get}") } } import com.amazon.deequ.constraints.ConstraintStatus if (verificationResult.status == CheckStatus.Success) { println("The data passed the test, everything is fine!") } else { println("We found errors in the data:\n") val resultsForAllConstraints = verificationResult.checkResults .flatMap { case (_, checkResult) => checkResult.constraintResults } resultsForAllConstraints .filter { _.status != ConstraintStatus.Success } .foreach { result => println(s"${result.constraint}: ${result.message.get}") } } If we run the example, we get the following output: We found errors in the data: CompletenessConstraint(Completeness(productName)): Value: 0.8 does not meet the requirement! PatternConstraint(containsURL(description)): Value: 0.4 does not meet the requirement! We found errors in the data: CompletenessConstraint(Completeness(productName)): Value: 0.8 does not meet the requirement! PatternConstraint(containsURL(description)): Value: 0.4 does not meet the requirement! The test found that our assumptions were violated! Only 4 out of 5 (80%) of the values of the productName attribute are non-null, and only 2 out of 5 (i.e., 40%) values of the description attribute did contain a URL. Fortunately, we ran a test and found the errors; somebody should immediately fix the data! QA Checks With Statistical Methods QA Checks With Statistical Methods While Deequ offers a robust framework for data validation, integrating statistical methods can further enhance your QA checks, especially if you are dealing with aggregated metrics of a dataset. Let’s see how you can employ statistical methods to monitor and ensure data quality. Record Count Tracking Record Count Tracking Consider a business scenario where an ETL (Extract, Transform, Load) process produces N records on a daily scheduled job. Support teams may want to set up QA checks to raise an alert if there is a significant deviation in the record count. For instance, if the process typically generates between 9,500 to 10,500 records daily over two months, any significant increase or decrease could indicate an issue with the underlying data. We can use a statistical method to define this threshold on which process should raise an alert to the support team. Below is an illustration of record count tracking over two months: To analyze this, we can transform the record count data to observe the day-to-day changes. These changes generally oscillate around zero, as shown in the following chart: When we represent this rate of change with a normal distribution, it forms a bell curve, indicating that the data is distributed normally. The expected change is around 0%, with a standard deviation of 2.63%. This analysis suggests that the record count typically falls within the -5.26% to +5.25% range with 90% confidence. Based on this, you can establish a rule to raise an alert if the record count deviates beyond this range, ensuring timely intervention. Attribute Coverage Tracking Attribute Coverage Tracking Attribute coverag e refers to the ratio of non-NULL values to the total record count for a dataset snapshot. For example, if 8 out of 100 records have a NULL value for a particular attribute, the coverage for that attribute is 92%. Attribute coverag Let’s review another business case with an ETL process generating a product table snapshot daily. We want to monitor the coverage of product description attributes. If the coverage falls below a certain threshold, an alert should be raised for the support team. Below is a visual representation of attribute coverage for product descriptions over two months: By analyzing the absolute day-to-day differences in coverage, we observe that the changes oscillate around zero: Representing this data as a normal distribution shows that it is normally distributed with an expected change of around 0% and a standard deviation of 2.45%. As we see, for this dataset, the product description attribute coverage typically ranges from -4.9% to +4.9% with 90% confidence. Based on this indicator, we can set a rule to raise an alert if the coverage deviates beyond this range. QA Checks With Time Series Algorithms QA Checks With Time Series Algorithms If you work with datasets that show significant variations due to factors like seasonality or trends, traditional statistical methods might trigger false alerts. Time series algorithms offer a more refined approach, improving the accuracy and reliability of your QA checks. To produce more sensible alerts, you can use either the Autoregressive Integrated Moving Average (ARIMA) or the Holt-Winters Method . The former is good enough for datasets with trends, but the latter lets us deal with datasets with both trend and seasonality. This method uses components for level, trend, and seasonality, which allows it to adapt flexibly to changes over time. Autoregressive Integrated Moving Average (ARIMA) Autoregressive Integrated Moving Average (ARIMA) Holt-Winters Method Holt-Winters Method Let’s mock-model daily sales that exhibit both trend and seasonal patterns using Holt-Winters: import pandas as pd from statsmodels.tsa.holtwinters import ExponentialSmoothing # Load and preprocess the dataset data = pd.read_csv('sales_data.csv', index_col='date', parse_dates=True) data = data.asfreq('D').fillna(method='ffill') # Fit the Holt-Winters model model = ExponentialSmoothing(data, trend='add', seasonal='add', seasonal_periods=365) fit = model.fit() # Forecast and detect anomalies forecast = fit.fittedvalues residuals = data - forecast threshold = 3 * residuals.std() anomalies = residuals[abs(residuals) > threshold] print("Anomalies detected:") print(anomalies) import pandas as pd from statsmodels.tsa.holtwinters import ExponentialSmoothing # Load and preprocess the dataset data = pd.read_csv('sales_data.csv', index_col='date', parse_dates=True) data = data.asfreq('D').fillna(method='ffill') # Fit the Holt-Winters model model = ExponentialSmoothing(data, trend='add', seasonal='add', seasonal_periods=365) fit = model.fit() # Forecast and detect anomalies forecast = fit.fittedvalues residuals = data - forecast threshold = 3 * residuals.std() anomalies = residuals[abs(residuals) > threshold] print("Anomalies detected:") print(anomalies) Using this method, you can detect significant deviations that might indicate data quality issues, providing a more nuanced approach to QA checks. I hope this article will help you efficiently implement QA checks for your large datasets. By using the Deequ library and integrating statistical methods and time series algorithms, you can ensure data integrity and reliability, ultimately enhancing your data management practices. Implementing the techniques described above will help you prevent potential issues in downstream applications and improve the overall quality of your data workflows.