Poor Data Quality is the Bane of Machine Learning Modelsby@DataGeneralist
349 reads
349 reads

Poor Data Quality is the Bane of Machine Learning Models

by Steven FinkelsteinNovember 1st, 2022
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

Data quality is an underrated aspect of machine learning. Business leaders need to take it more seriously.

Company Mentioned

Mention Thumbnail
featured image - Poor Data Quality is the Bane of Machine Learning Models
Steven Finkelstein HackerNoon profile picture

Whether it is Alexa answering a question or Elon preaching self-driving cars, business leaders see exciting machine learning use cases everywhere they look. The hype around artificial intelligence (AI) and machine learning (ML) is staggering. So much of the mainstream discussion centers around these exciting applications, but they often leave out the inputs and processes required to build these systems.

Quick digression (Optional). Most media will incorrectly lump all AI applications as the same type. However, there are two types of artificial intelligence: narrow and general. Narrow AI optimizes a specific task or objective, which is 99.99% of use cases out there. Outside of this narrow objective, a human is smarter in every other situation. General AI is developing intelligence that is broader and more human-like (e.g. the Terminator). There is a HUGE difference between the AI used to forecast the weather and the AI to build the Terminator.

What is Required to Build a Machine Learning System?

The machine learning development lifecycle requires 4 key components:

  • Business Problem
  • Data
  • Machine Learning Algorithm
  • IT Infrastructure

Stakeholders requesting a machine learning system define the business problem that is in need of a solution (e.g. forecasting future sales). A tech professional will create a model by training a machine learning algorithm based on a relevant dataset. The trained model will output a prediction based on the patterns found in the dataset. Underpinning this model is the IT infrastructure (e.g. AWS). The infrastructure is required for the storage, compute, and deployment involved in the development lifecycle. Data requires storage. Data prep and model training requires compute (i.e. CPU/GPU). Model deployment and serving requires a host (e.g. API endpoint in the cloud).

Full end-to-end ML development lifecycle; image by author

What is Machine Learning?

A simple example of machine learning would be predicting what I wear based on the weather. If you knew the temperature outside, the wind chill, humidity, and what I wore over the past 2 years, then you could likely create a model. The weather indicators (e.g. temp, humidity, wind) would be used to predict what I wear each day. There are likely patterns that map between the weather outside and what I chose to wear that day. Train the model on this data and voila — you have your predictive model. How could this possibly go wrong?

Note: The weather indicators are referred to as features, predictors, or input. “What I wear each day” is referred to as the target variable or output.

Why do ML Systems Fail?

Many stakeholders have the misconception that machine learning is magic. All you need is some data fed into an algorithm and the result is a beautiful model. This couldn’t be further from the truth. Machine learning is a process that finds patterns in the data between the input (features/predictors) and output (target variable). Mathematical functions are used to map each type of input to the most common output in that scenario. The “magic” requires stable, unique patterns between various combinations of the input and output. If the patterns are unstable, or fail to provide enough signal, then the model will likely be useless.

This branch of data science is called, data quality. Data quality refers to the set of activities applying quality management techniques to data in order to ensure it is fit to serve the needs of an organization in a specific context or problem. Data quality is the root cause for why many machine learning systems fail. The primary dimensions for measuring data quality are the following:

  • Accuracy: The data reflects the real-world objects and/or events it is intended to model

  • Completeness: The data makes all required records and values available

  • Consistency: Data values drawn from multiple locations do not conflict

  • Validity: The data conforms to defined business rules and falls within allowable parameters when those rules are applied

  • Uniqueness: No record exists more than once in the dataset, even if it exists in multiple locations. Every record can be uniquely identified and accessed within the dataset and across applications.

  • Timeliness: Data is updated as frequently as necessary to ensure it meets user requirements for accuracy, accessibility, and availability

While Informatica specifies these 6 dimensions and their respective definitions, I would add the following:

  • Diversity of information to encourage uniqueness of records

  • Representative sample of data from the population to reduce biases in how data was captured and what data was captured

  • Clean data ensures that each value that is incorrect, corrupted, incorrectly formatted, duplicate, or incomplete is fixed or removed

Note: If you want to read about other dimensions, I’d check out this MIT paper from Aug, 1991.

How Does Poor Data Quality Cause Failure?

The weather model would be useless, or worse*, if the data doesn’t meet the standards of high quality per each of the dimensions listed above. Using the table below, we can zoom into the data to see how poor data quality could present itself. Assume that this information represents one adult, the weather where they live and work, and the clothes they wore that day.

table by author

Data Quality Issue #1: Unclean Data

Data scientists and data engineers spend the majority of their time cleaning data. The lack of standardized formatting in the “Humidity” column is problematic. Without standardized formatting, data is unclear. For example, is the “.05” humidity from January 3rd the actual humidity, a typo, or did the user intend to convert .05 to 5 percent? Best practice would suggest only storing numeric data in this column and specifying the “percentage” label in the column header or another document. If I received this dataset, I would remove all “%” signs, any “perc” or “percent” text, remove any whitespace, and either convert the Jan 3 humidity value from .05 to 5 or remove it completely.

Data Quality Issue #2: Lack of Diverse data

Ignoring the formatting issues, do you notice anything strange about this dataset? Why would someone wear a short sleeve shirt and shorts in January when it is freezing outside? Take a moment and try to brainstorm possible explanations before you continue reading. If you are struggling with ideas, I would suggest questioning each assumption you made about the dataset.






Here are 5 possible explanations I could come with for why someone would wear shorts and a t-shirt during a cold day in January:

  1. Temperature was recorded incorrectly by the user.

  2. The clothing worn was recorded incorrectly by the user.

  3. The user recorded the temperature in Celsius when it should be Fahrenheit. 22 degrees Celsius converts to 71.6 degrees Fahrenheit.

  4. The person never went outside and had to endure the cold temperatures.

  5. The person only went outside to exercise. While you are running, your body temperature increases.

After digging into this problem further, imagine that we learned this person works from home some days. The dataset above is excluding this important piece of information. Without knowing whether the person is staying home or going to work, the dataset is likely not informative enough. Take a look at the temperature and clothes worn each day. The person seems to wear a short sleeve shirt and shorts on most days, regardless of the temperature. If you feed a model this data, it will likely predict a short sleeve shirt and shorts in the Winter and Summer because of the lack of differentiation across records.

Data Quality Issue #3: Non-representative Data Set

A dataset must take a sufficient sample of the entire population of data that aims to reduce bias. In the table above, there is no information captured in February, July, August, September, November, and December. There could be important information about how this person dresses during these months that is different than the other months. If this person is a teacher, they might have July and August off and dress in a swim suit because they go to the beach those months. The omission of these months adds bias to the model and limits its accuracy.

Other Data Quality Issues

While I chose to focus on these 3 data quality issues, the dataset had many more. Below is a list of other data quality problems, or potential problems, present within the dataset.

  • Each column should only have one piece of information (e.g. month rather than full date)
  • The year that the date represents should be captured
  • The dataset is not a sufficient size to create a decent model
  • It is unclear what time of day any weather columns represent
  • It is unclear what time of day the clothing worn represents
  • Column labels should not be included in the values, but rather in column headers or in another document (e.g. data definition document)
  • Categorical data needs to be converted to numeric data before model training

How Often Do These Situations Occur?

Data quality issues are always something you need to consider when analyzing data or making data-driven decisions. In a recent project at work, I had to explain to stakeholders that the machine learning model I developed cannot improve without changes to the data. There was not enough diversity of information for the patterns to yield a higher level of accuracy.

Because our society is becoming increasingly data-driven, data quality will continue to be a crucial branch of data science. Before you kick off development for the next data-intensive application, I suggest checking the data’s quality or seeking the opinion of data professionals who work with it every day.

Also published here by The Data Generalist

Data Science Career Advisor

A model that draws the wrong conclusions is worse than a clearly useless model that gets ignored.

Sometimes developing a proof of concept is necessary to convince stakeholders that a larger effort must be undertaken to improve data quality. This situation is one reason why iterative improvement (e.g. agile) is often touted as the best practice in software development.

Image source: Stable Diffusion AI Demo