How To Improve Data Quality for More Efficient Business Operations

Data is information such as facts or statistics stored in a form that is efficient for movement, processing or communication. The main examples of data are prices, costs, employee names, product names, addresses, tax codes etc. Every information system contains data that it processes.

Data inconsistency with reality can lead to various errors in business processes and affect the quality of services provided. The degree of this influence may differ depending on various factors:

Data type: for example, an incorrect SSN or passport number will have more negative impact than a typo in the text description of the product
Amount of data: during the operation of an information system, the amount of data it contains often grows, while the likelihood of incorrect data appears
Scope of the company: there are industries in which it is necessary to pay special attention to the correctness of the data, for example, financial institutions or the space industry
Level of criticality of the system: it is obvious that for systems of the level Mission Critical and Business Critical, the appearance of an error as a result of incorrect data is more critical than in the systems of Business Operational or Office Productivity
Degree of integration with other systems: the greater the integration with other systems, the higher the chance of an error when receiving data from another system

The degree to which data meets certain requirements is defined by the term 'Data Quality' (DQ). Problems causing DQ degradation may include missing values, duplicates, inconsistencies, anomalous values, data integrity violations, and incorrect formats.

Various algorithms can be used to calculate DQ. In this case, as a rule, data from a mart or replica containing a snapshot of data at a specific point in time and located in the data warehouse are used for analysis. For small systems, in some cases, data from backup copies of the production database can be used, or very rarely from the production database itself, in the absence of the ability to deploy other sources.

In a simplified form, a validation algorithm can be described as a process that analyzes data according to a specific condition.

This is easier to understand with a simple example. Let's say we have a table of 100 records with customer information, containing for each customer first name, last name, date of birth, ssn, email, phone:

Obviously, each client must have a date of birth, while, theoretically, a person may not have an email or phone number. To check if the date of birth in our table is not empty, we must check each record for the NOT NULL condition in the birthday field. We are running a query that will return the number of records with an empty date of birth. Let's say there are 20 such records out of 100. Then we can say that we have 20% of incorrect records in terms of the date of birth, so the quality of the data is 80%. How good or bad is that? It all depends on the system and on the data. In some cases, having even one empty entry may not be acceptable.

When we say that a field must not be empty, we mean an assessment of the quality of the data according to the criterion of 'completeness'. In addition, other criteria for assessing DQ can be classified as consistency, reliability, accuracy, unambiguity, uniqueness, consistency, compliance. These are the most common criteria (dimensions), in practice there may be others.

The term 'data quality requirement' or simply 'requirement' is used to describe the conditions for checking DQ. For example, the requirement can be formulated as follows: "SSN must be specified, as well as at least one of the communication methods." Then the condition for selecting incorrect records in SQL will look like this:

ssn is null or email is null and phone is null

Consider another version of the requirement: "Date of birth and SSN must be filled in and match the intervals and template." Then the conditions for incorrect entries can be:

Date of birth is empty
Date of birth does not match the intervals (for example, 01-01-9999)
SSN is empty
SSN contains characters

If any of these conditions are met, the record will be considered incorrect. Obviously, the check will consist of several conditions connected by the 'OR' operator:

birthday is null or birthday> now () or ssn is null or not is_numeric (ssn)

All conditions are described in the requirement and then interpreted in SQL (or other code) to perform validation on the dataset. Based on the needs of the business, there may be several requirements, each of which will address the problem of DQ in a specific context. One requirement may be intended to ensure the operation of the system and require the completion of mandatory fields, another for completeness of contact information for customers and will give an overall assessment of how, for example, users prefer to provide information about themselves when it comes to those customers who have registered online.

An important point is that sometimes we no need to check all the data, but only the part of interest to us. For example, in order to check the completeness of the phone for those customers who register online, we must use some kind of feature to filter out the set of checked data. Suppose we have one more field - 'registration channel', let's call it 'channel'. In this field we will have the value 'website' if the client registered online, and 'office' if the client was registered by a company employee in the office. The table will look like this:

Now, in order to check what percentage of customers specified their phone number when registering on the site, we will select the data according to the condition:

where channel = 'website'

Using SQL, the resulting query might look like this:

SELECT count (email) / count (*) * 100 FROM test.clients WHERE channel = 'website';

In our test set, there are 61 records with a website channel, of which 15 with an empty email field, giving a data quality about 75%.

In the examples above, we looked at the process of evaluating DQ on a single table. Other options for performing checks are possible, for example, using a dataset that will be analyzed. A dataset can be an SQL query containing a join of several tables, views, functions, etc.

Specific values for assessing DQ are usually called 'metrics'. Metrics can be complex. For example, we want to evaluate the completeness of contact information for the entire table, but at the same time, we know that customers registering on the site are required to specify their email address, while those who visited the office were asked to provide their phone number. Then, to the above request to check if the email is full, we need to add a request to check the completeness of the phone for users who visited the office:

SELECT count (phone) / count (*) * 100 FROM test.clients WHERE channel = 'office';

In our test set, this query returns a data quality of 58%.

At the same time, we can evaluate these two indicators both separately (75% and 58%), and together, to obtain a complex assessment, taking, for example, the average between the two: (75 + 58) / 2 = 66.5%. Therefore, 66.5% is the value of a complex quality indicator based on customer contact information. In addition to the percentage expression, indicators can be used in the form of a proportion or quantity, calculated according to the corresponding formula, determined individually according to their own rules.

Improving data quality reduces the number of possible errors in business processes and leads to more efficient operation of the entire organization.

Based on various factors, thresholds can be set for indicators, which will indicate whether the values of the indicators are within acceptable limits. For example, we can set a threshold of 95% and, if the indicator is below the threshold, inform responsible users about low DQ, for example, by email.

The gradations of the threshold values can be different. You can simply set the limit below which DQ should not be. You can also set several intervals, each of which will correspond to the so-called quality category, and for each quality category, set the corresponding reaction. For example:

From 90% to 100% - do nothing
From 70% to 90% - send an email to the employee responsible for DQ
From 50% to 70% - send an email and register an incident with a low priority in the enterprise management system
Below 50% - register an incident with a high priority and send email and sms to the employee and his manager.

The sets of commands for performing DQ assessments, the result of which is the values of indicators, are called 'checks'. After development, the same check can be run several times to get the values of the indicators at different times. Comparison of indicators obtained at different times will allow assessing the dynamics of data quality and planning activities to improve DQ.

In order to ensure an increase in the quality of the data, it is necessary to notify the users responsible for this data. Users, in turn, must take steps to correct errors in the data. In other words, actualize the data.

Where data is changing/added, the DQ assessment process should be iterative and continuous. In a simplified form, the following steps can describe it:

Perform a quality check
Calculate and assess metrics
Inform users with low DQ metrics
Users correct errors in data
Return to step 1

In addition to updating the data itself, there can also be measures aimed at preventing the appearance of incorrect data, for example:

Validation of the form when entering data by users when registering on the site
Validation of the entered data in the application of the office employee
Creating constraints in the database
Verification of data coming from other information systems

Etc.

Of course, in the article I gave the most simplified explanation of data quality and, in most cases, you no need to write all the code to perform checks, for example, in SQL. There are a large number of DQ tools that each company can choose from, based on their own needs and capabilities. But regardless of the tool chosen, it is more important to implement the DQ control process itself. In addition, there is a series of ISO 8000 International DQ Standards designed to provide consistently interpreted DQ concepts and related terms. Focusing on standards, each company can develop its own approach to DQ management, and the presence of an ongoing process for assessing DQ and measures to increase it can have a positive effect on the efficiency of information systems, and ultimately lead to an increase in the company's profits.