Data quality assessment is the continuous scientific process of evaluating if your data meets the standards. These standards may be tied to your business or the project goals. The need for ensuring data quality has increased as the many different ways to acquire data are multi-folded. Handling a single data source alone can be challenging at times. Say, for example, a customer survey. It's often difficult to normalize each respondent's information, even with online survey tools. Now imagine integrating and standardizing data from ERPs, CRMs, HR systems, and not to mention the many different sensors we use these days. Without data quality assessments, these are a problem for a lifetime. But there is good news! We've evolved along the complexities surrounding data acquisition and management. Data quality assessments play a crucial role in data governance. They help us identify incorrect data issues at various levels in a data pipeline. They also help us quantify the business impact and take corrective measures as soon as possible. Poor quality data can have serious consequences. Take, for instance, a data quality issue in the healthcare industry. Suppose the data entry person duplicated a patient's record; the patient would receive two doses of the drug instead of one. The consequences can be disastrous. such as the above can have terrible effects regardless of the industry. But duplication is only one kind of data quality issue. There is a spectrum of other quality problems we need to worry about. Quality issues Let's imagine you're working on an inventory optimization problem. The stock is monitored through an automated system. What happens if one of your sensors sends values twice the original frequency? Unreliable data will lead you to stock up on items already in the warehouse yet missing out on all the high-demand stuff. See your data acquisition and management processes from different angles. Data quality has six dimensions: Accuracy, Completeness, Consistency, Timeliness, Validity, and Uniqueness. We can also think of data accountability and orderliness as other critical characteristics. The different dimensions discussed here are the scales against which we evaluate our data quality. Maintaining 100% quality in a vast data lake is nearly impossible. Data quality tolerance is a strategic decision we must make as early as possible. But that's for a future post. Data accuracy. Data accuracy is a prevalent quality aspect everyone is battling to get right. But what does data accuracy means anyway? Data accuracy is to what extent the data at hand captures the reality. The apparent cause is the data entry --- typos in the name and wrong values for age. But there are more disastrous issues. . Lockheed Martin, a team of English engineering team, was working with NASA to run the program. Different measurement units used by the two groups caused the communication blackout with the spacecraft. NASA once lost a $125 M spacecraft Measurement units are the most common cause of data inaccuracy. Completeness of Data Data completeness refers to your datasets having all the required information on every record. The requirements depend on the application and business needs. For instance, phone numbers have only a little use for a machine learning model, whereas it's critical for a delivery system. Form validations and database constraints help a lot in reducing completeness errors. Yet, planning mistakes often make huge impacts on the quality of data. Data completeness is a tradeoff. The more strict you are on the fields, the less you get on records. This tradeoff is valid for both manual and automatic data acquisition. If you make all the fields mandatory in a survey, you don't get as many responses as you intend. On the automated side, let's say you put a constraint on GPS coordinates for a data stream coming from a remote camera. You install a set of new devices that may not support GPS and send data that won't get accepted to your data lake. It is a challenging dimension to score well. The complexity grows high as you acquire data from more sources. Consistency in Data Data consistency is having no contradiction in the data received from different sources. Because each data source may have a unique way of measuring the information, they sometimes don't match others. Say you want to find out the daily sales volume of a particular product. Your inventory management tracks sales based on the remaining items. Your POS tracks the same based on the items sold. Items returned may sneak into the inventory system without having a record in the POS. At the integration time, these two systems would tell different numbers for daily sales volume. In the ideal world, both systems should account for returns. But it's rarely the case, given the complexities of large-scale organizations. Timelines in data quality Data should be available at the time it's required in the system. Suppose you generate a report every Friday, and not all your data have arrived yet; it'll seriously alter your organization's decisions and directions. Several reasons affect the timeliness of data: There are network issues. Read about edge computing if you think cities have decent internet connections and nothing to worry about. The whole concept is built to reduce network latency. There can be operational issues. The product returns and daily sales calculations are good examples of a lack of timeliness. We have problems arising at the point of data collection. They could be wrong data entry, malfunctioning sensors, etc. Invalid data I finished my high-schools about 12 years ago. But I still receive brochures from institutions that are targeting school children. It's an excellent example of having invalid data. Invalid data are records that don't have a meaning anymore. They fill up space with no use. Also, when they are used can be dangerous too. Invalid data costs a lot, yet the invalidate rules are blurry in some cases. For example, how do we know if a patient has fully recovered from disease unless you're the doctor or the patient yourself? Some decease may have an average time range for recovery. But not all. In such cases, you keep invalid data in your data store and make painful (sometimes harmful) decisions based on them. Uniqueness Uniqueness in data means no replication of the same information twice or more. They appear in two forms; duplicate records and information duplication in multiple places. Duplicate records are often easy to pick. They appear more than once in the same datasets and are relatively straightforward to remove automatically. A good practice is to use a key column to impose a uniqueness constraint rather than the whole record. That is because specific repetitive entries may contain some fields that aren't unique anymore. Most transactional entries have a timestamp which is a perfect example. They don't appear as duplicates if we don't use one or a combination of a few fields for de-dupe. Information duplication is storing the same information in different places. For example, a patient's age may be on the admissions table and the surgery table. It's not just a good design. Duplicated information is the gateway to other quality issues. Failing to update all the records will create inconsistencies. At least one of them is, anyway, inaccurate. Another not so apparent duplication is derived information. Take age and date of birth. One is enough to find out the other. But storing both creates ambiguity. How to perform a data quality assessment? We need to perform data quality assessments for every critical area in our data store. The most granular you can go is to the field level. But you can also check up to a database level. Data quality assessment is an iterative process to verify if your data meets the required standards. Each iteration will have the following six phases. 1. Define data quality targets. In the 'define' phase, we translate the business goals into data quality targets and decide on what is acceptable quality. This matrix should be measured against each of the six data quality dimensions. Reaching 100% is unlikely in large-scale applications. But if you're working with smaller datasets, you can be more strict about them. If you are a healthcare app that sends subsequent dosage alerts, you need to maintain a log of every dose the patient took. The timestamp field in every record is a crucial piece of information for the next dosage. Hence it should have a threshold of near 100% against all six dimensions. But in case you own a cake shop and want to send a birthday card every year, your rules can be far more flexible. The address or the phone number field should have a high threshold (say about 90%) for accuracy. Yet, they can have an average target (somewhere around 60%) for uniqueness because people sometimes give their alternative phone numbers when they buy. These thresholds also depend on the domain. As seen in the last two examples, the cost of a mistake is minuscule in the second case compared to healthcare. These rules can be on multiple granularities. For instance, the address column can have a unique threshold. But we can also impose a completeness threshold as each record should have the phone number or the mailing address. 2. The data quality assessment In the 'assessment' phase, we evaluate our datasets against the rules we defined on the six data quality dimensions. Each will end up with an acceptable score. The acceptance score is the percentage of records that satisfy the conditions. On a small dataset or database, it's pretty easy to conduct these experiments manually. However, in a vast data warehouse, you need some automation to verify the data quality. 3. Analyse the assessment score. Data quality assessment does not end with the assessment phase. A data quality assessment aims to identify the business impact as early as possible and implement corrective measures. Estimating the business impact is the goal of this phase. It's a complex exercise, and it's not domain agnostic. The way one organization summarizes the assessment scores differ from others. But the goal of the phase is obvious. We are finding the most significant holes where the quality of data leaks and will fix them. 4. Brainstorm for improvements In the 'brainstorm' phase, we collaboratively developed ideas that could fix the gaps we found. It is best to have a unit that includes members from every team so that the plans are desirable by those who are responsible for it; technically feasible, and; economically viable. Desirability, feasibility, and viability can make or break an idea. Here's an example. Let's suppose we identified from the analysis phase that patients don't fill up some critical information at the front desk survey. Your technical team may suggest making the survey electronic. Electronic surveys can impose validation checks that are hard to bypass. A doctor or a nurse in the group may say this solution is undesirable as people rush to get into the treatment. The front desk staff may say they are in a hurry even when checking out. Hence, your final solution may be a primary form fill-up at admission and a detailed electronic survey after their treatment. You may have noticed the solutions are not always technical. Data quality can not be improved only by fixing the data pipeline. It may need, sometimes, unconventional strategies that aren't apparent. Yet, some of the most frequent top-of-the-mind answers could help as well. Here are a few. Automate data collection. A substantial amount of just by this single trick. It improves the quality against all six dimensions. If the system can do it, let it do it. quality issues can be resolved Create electronic forms whenever possible. When automation is not possible, the next best solution is to create an with validations. It makes it harder to skip a critical question on an electronic survey than on a paper-based one. And you also save a ton of time by avoiding digitizing your collections. electronic form Create up-to-date metadata and share it with the relevant parties. Metadata is the description of a dataset. They include everything but the data to help users understand why this dataset exists. Commonly metadata includes field types, field validations, and constraints. speed up automation and communicate the requirements among teams without ambiguity. Maintaining updated metadata helps 5. Implement the strategies to improve data quality. Implementation takes more time and effort than any of the other five tasks in this assessment. But more time spent on the analysis and brainstorming phases could ease the work here. We know that the strategies to improve data quality can be unconventional. Good timing to do a survey is essential as imposing validations on an electronic form. Hence the implementation of these ideas is a commitment from multiple teams. In most cases, all the way up to the . C suite 6. Control The final 'Control' stage is about the next iteration of the data quality assessment. How do we test the strategies we just implemented? Do the same matrix work, or should we need a different set of measures? Should we upgrade (or downgrade) the quality targets? When is a reasonable timeframe for the next iteration? Final Thoughts I discussed the basics of data quality dimensions and assessment in this post. Data quality assessments are vital to maintaining a reliable data source. And they are crucial for data-driven decision-making. Depending on the domain and application types, impacts of poor quality data can be anywhere between insignificant to disastrous. To understand data quality assessments, we must first understand the six dimensions of data quality. It's against these six dimensions we conduct the quality assessment. The data quality assessment itself is a recurring process. Each iteration can have six stages to ensure the quality gaps are correctly identified and measures are taken to fulfill them. The learnings of one iteration help adjust the targets for the next iteration. Originally published . here

Lockheed Martin

Target

How to Manage Configurations Easily Using TOML Files

What is a Citizen Data Scientist and How Do You Become One?

Subscribe to my posts.

Nominated for 2022 - HackerNoon Contributor of the Year - Data Privacy

Nominated for 2022 - HackerNoon Contributor of the Year - Data Security

Nominated for 2022 - HackerNoon Contributor of the Year - Big Data

Nominated for 2022 - HackerNoon Contributor of the Year - Database

Too Long; Didn't Read

How to Improve Data Quality in 2022

How to Improve Data Quality in 2022

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

7 Ways to Make Your Python Project Structure More Elegant

The Noonification: Feature Optimization for Price Prediction (11/26/2023)

10 Ways to Optimize Your Database

10 Essential Computer Skills for Data Mining

10 Most Evolving Big Data Technologies to Catch Up on in 2022

Top 10 JavaScript Charting Libraries for Every Data Visualization Need

7 Ways to Make Your Python Project Structure More Elegant

The Noonification: Feature Optimization for Price Prediction (11/26/2023)

10 Ways to Optimize Your Database

10 Essential Computer Skills for Data Mining

10 Most Evolving Big Data Technologies to Catch Up on in 2022

Top 10 JavaScript Charting Libraries for Every Data Visualization Need

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps