How to Improve Data Quality in 2022 by@thuwarakesh

How to Improve Data Quality in 2022

The need for ensuring data quality has increased as the many different ways to acquire data are multi-folded. Maintaining 100% quality in a vast data lake is nearly impossible. Data accuracy is a prevalent quality aspect everyone is battling to get right. But what does data accuracy mean anyway? Data accuracy means to what extent the data at hand captures the reality of the data. The more data you acquire, the less you get on the strict strict tradeoff of data. The more you acquire from more sources, the more you score well to score well.
Thuwarakesh Murallie HackerNoon profile picture

Thuwarakesh Murallie

Data Scientist @ Stax, Inc.

linkedin social icongithub social icontwitter social icon

Data quality assessment is the continuous scientific process of evaluating if your data meets the standards. These standards may be tied to your business or the project goals.

The need for ensuring data quality has increased as the many different ways to acquire data are multi-folded.

Handling a single data source alone can be challenging at times. Say, for example, a customer survey. It's often difficult to normalize each respondent's information, even with online survey tools. Now imagine integrating and standardizing data from ERPs, CRMs, HR systems, and not to mention the many different sensors we use these days. Without data quality assessments, these are a problem for a lifetime.

But there is good news! We've evolved along the complexities surrounding data acquisition and management.

Data quality assessments play a crucial role in data governance. They help us identify incorrect data issues at various levels in a data pipeline. They also help us quantify the business impact and take corrective measures as soon as possible.

Poor quality data can have serious consequences.

Take, for instance, a data quality issue in the healthcare industry. Suppose the data entry person duplicated a patient's record; the patient would receive two doses of the drug instead of one. The consequences can be disastrous.

Quality issues such as the above can have terrible effects regardless of the industry. But duplication is only one kind of data quality issue. There is a spectrum of other quality problems we need to worry about.

Let's imagine you're working on an inventory optimization problem. The stock is monitored through an automated system. What happens if one of your sensors sends values twice the original frequency? Unreliable data will lead you to stock up on items already in the warehouse yet missing out on all the high-demand stuff.

See your data acquisition and management processes from different angles.

Data quality has six dimensions: Accuracy, Completeness, Consistency, Timeliness, Validity, and Uniqueness. We can also think of data accountability and orderliness as other critical characteristics.

Data scientists and executives should think about these six data quality dimensions when creating new strategies.

Data scientists and executives should think about these six data quality dimensions when creating new strategies.

The different dimensions discussed here are the scales against which we evaluate our data quality. Maintaining 100% quality in a vast data lake is nearly impossible. Data quality tolerance is a strategic decision we must make as early as possible. But that's for a future post.

Data accuracy.

Data accuracy is a prevalent quality aspect everyone is battling to get right. But what does data accuracy means anyway?

Data accuracy is to what extent the data at hand captures the reality. The apparent cause is the data entry --- typos in the name and wrong values for age.

But there are more disastrous issues.

NASA once lost a $125 M spacecraft. Lockheed Martin, a team of English engineering team, was working with NASA to run the program. Different measurement units used by the two groups caused the communication blackout with the spacecraft.

Measurement units are the most common cause of data inaccuracy.

Completeness of Data

Data completeness refers to your datasets having all the required information on every record. The requirements depend on the application and business needs. For instance, phone numbers have only a little use for a machine learning model, whereas it's critical for a delivery system.

Form validations and database constraints help a lot in reducing completeness errors. Yet, planning mistakes often make huge impacts on the quality of data.

Data completeness is a tradeoff. The more strict you are on the fields, the less you get on records. This tradeoff is valid for both manual and automatic data acquisition. If you make all the fields mandatory in a survey, you don't get as many responses as you intend. On the automated side, let's say you put a constraint on GPS coordinates for a data stream coming from a remote camera. You install a set of new devices that may not support GPS and send data that won't get accepted to your data lake.

It is a challenging dimension to score well. The complexity grows high as you acquire data from more sources.

Consistency in Data

Data consistency is having no contradiction in the data received from different sources. Because each data source may have a unique way of measuring the information, they sometimes don't match others.

Say you want to find out the daily sales volume of a particular product. Your inventory management tracks sales based on the remaining items. Your POS tracks the same based on the items sold. Items returned may sneak into the inventory system without having a record in the POS.

At the integration time, these two systems would tell different numbers for daily sales volume.

In the ideal world, both systems should account for returns. But it's rarely the case, given the complexities of large-scale organizations.

Timelines in data quality

Data should be available at the time it's required in the system. Suppose you generate a report every Friday, and not all your data have arrived yet; it'll seriously alter your organization's decisions and directions.

Several reasons affect the timeliness of data:

  1. There are network issues. Read about edge computing if you think cities have decent internet connections and nothing to worry about. The whole concept is built to reduce network latency.
  2. There can be operational issues. The product returns and daily sales calculations are good examples of a lack of timeliness.
  3. We have problems arising at the point of data collection. They could be wrong data entry, malfunctioning sensors, etc.

Invalid data

I finished my high-schools about 12 years ago. But I still receive brochures from institutions that are targeting school children. It's an excellent example of having invalid data.

Invalid data are records that don't have a meaning anymore. They fill up space with no use. Also, when they are used can be dangerous too.

Invalid data costs a lot, yet the invalidate rules are blurry in some cases. For example, how do we know if a patient has fully recovered from disease unless you're the doctor or the patient yourself? Some decease may have an average time range for recovery. But not all. In such cases, you keep invalid data in your data store and make painful (sometimes harmful) decisions based on them.


Uniqueness in data means no replication of the same information twice or more. They appear in two forms; duplicate records and information duplication in multiple places.

Duplicate records are often easy to pick. They appear more than once in the same datasets and are relatively straightforward to remove automatically.

A good practice is to use a key column to impose a uniqueness constraint rather than the whole record. That is because specific repetitive entries may contain some fields that aren't unique anymore. Most transactional entries have a timestamp which is a perfect example. They don't appear as duplicates if we don't use one or a combination of a few fields for de-dupe.

Information duplication is storing the same information in different places. For example, a patient's age may be on the admissions table and the surgery table. It's not just a good design.

Duplicated information is the gateway to other quality issues. Failing to update all the records will create inconsistencies. At least one of them is, anyway, inaccurate.

Another not so apparent duplication is derived information. Take age and date of birth. One is enough to find out the other. But storing both creates ambiguity.

How to perform a data quality assessment?

We need to perform data quality assessments for every critical area in our data store. The most granular you can go is to the field level. But you can also check up to a database level.

These six steps help us do a continuous data quality assessment for an organization.

These six steps help us do a continuous data quality assessment for an organization.

Data quality assessment is an iterative process to verify if your data meets the required standards. Each iteration will have the following six phases.

1. Define data quality targets.

In the 'define' phase, we translate the business goals into data quality targets and decide on what is acceptable quality. This matrix should be measured against each of the six data quality dimensions.

Reaching 100% is unlikely in large-scale applications. But if you're working with smaller datasets, you can be more strict about them.

If you are a healthcare app that sends subsequent dosage alerts, you need to maintain a log of every dose the patient took. The timestamp field in every record is a crucial piece of information for the next dosage. Hence it should have a threshold of near 100% against all six dimensions.

But in case you own a cake shop and want to send a birthday card every year, your rules can be far more flexible.

The address or the phone number field should have a high threshold (say about 90%) for accuracy. Yet, they can have an average target (somewhere around 60%) for uniqueness because people sometimes give their alternative phone numbers when they buy.

These thresholds also depend on the domain. As seen in the last two examples, the cost of a mistake is minuscule in the second case compared to healthcare.

These rules can be on multiple granularities. For instance, the address column can have a unique threshold. But we can also impose a completeness threshold as each record should have the phone number or the mailing address.

2. The data quality assessment

In the 'assessment' phase, we evaluate our datasets against the rules we defined on the six data quality dimensions. Each will end up with an acceptable score. The acceptance score is the percentage of records that satisfy the conditions.

On a small dataset or database, it's pretty easy to conduct these experiments manually. However, in a vast data warehouse, you need some automation to verify the data quality.

3. Analyse the assessment score.

Data quality assessment does not end with the assessment phase. A data quality assessment aims to identify the business impact as early as possible and implement corrective measures. Estimating the business impact is the goal of this phase.

It's a complex exercise, and it's not domain agnostic. The way one organization summarizes the assessment scores differ from others.

But the goal of the phase is obvious. We are finding the most significant holes where the quality of data leaks and will fix them.

4. Brainstorm for improvements

In the 'brainstorm' phase, we collaboratively developed ideas that could fix the gaps we found. It is best to have a unit that includes members from every team so that the plans are

  • desirable by those who are responsible for it;
  • technically feasible, and;
  • economically viable.

Desirability, feasibility, and viability can make or break an idea.

Here's an example.

Let's suppose we identified from the analysis phase that patients don't fill up some critical information at the front desk survey. Your technical team may suggest making the survey electronic. Electronic surveys can impose validation checks that are hard to bypass. A doctor or a nurse in the group may say this solution is undesirable as people rush to get into the treatment. The front desk staff may say they are in a hurry even when checking out. Hence, your final solution may be a primary form fill-up at admission and a detailed electronic survey after their treatment.

You may have noticed the solutions are not always technical. Data quality can not be improved only by fixing the data pipeline. It may need, sometimes, unconventional strategies that aren't apparent.

Yet, some of the most frequent top-of-the-mind answers could help as well. Here are a few.

  • Automate data collection.

A substantial amount of quality issues can be resolved just by this single trick. It improves the quality against all six dimensions. If the system can do it, let it do it.

Create electronic forms whenever possible.

When automation is not possible, the next best solution is to create an electronic form with validations. It makes it harder to skip a critical question on an electronic survey than on a paper-based one. And you also save a ton of time by avoiding digitizing your collections.

  • Create up-to-date metadata and share it with the relevant parties.

Metadata is the description of a dataset. They include everything but the data to help users understand why this dataset exists. Commonly metadata includes field types, field validations, and constraints.

Maintaining updated metadata helps speed up automation and communicate the requirements among teams without ambiguity.

5. Implement the strategies to improve data quality.

Implementation takes more time and effort than any of the other five tasks in this assessment. But more time spent on the analysis and brainstorming phases could ease the work here.

We know that the strategies to improve data quality can be unconventional. Good timing to do a survey is essential as imposing validations on an electronic form. Hence the implementation of these ideas is a commitment from multiple teams. In most cases, all the way up to the C suite.

6. Control

The final 'Control' stage is about the next iteration of the data quality assessment. How do we test the strategies we just implemented? Do the same matrix work, or should we need a different set of measures? Should we upgrade (or downgrade) the quality targets? When is a reasonable timeframe for the next iteration?

Final Thoughts

I discussed the basics of data quality dimensions and assessment in this post.

Data quality assessments are vital to maintaining a reliable data source. And they are crucial for data-driven decision-making. Depending on the domain and application types, impacts of poor quality data can be anywhere between insignificant to disastrous.

To understand data quality assessments, we must first understand the six dimensions of data quality. It's against these six dimensions we conduct the quality assessment.

The data quality assessment itself is a recurring process. Each iteration can have six stages to ensure the quality gaps are correctly identified and measures are taken to fulfill them. The learnings of one iteration help adjust the targets for the next iteration.

Originally published here.

react to story with heart
react to story with light
react to story with boat
react to story with money
. . . comments & more!