Handling Data Integrity Issues Like a Pro

If you’re working on an application that requires data, either housed by yourself or referenced from an external source, you’re eventually going to run into some scary situations. Over the course of my career, I’ve found several recurring issues in data sources that I work with and have had to find workarounds for them. Here I’ll provide some of the common “gotchas” I’ve found in working with data, and the solutions I’ve found to be the most helpful in dealing with them.

Note: this is not meant to be a definitive guide to everything that can go wrong, but merely some things that I have found to be surprisingly common

Bad Integrity

In order for your application to be trusted and valued, the data needs to be trusted and valued as well. Your application is only as good as the data it is providing. If you’re receiving bad data it’ll be on your (and your app) to fix it so that you’re providing something that’s valuable and actionable to your end-users.

So what can make the data bad, and how can we make it good? Data can be bad in many ways, but the most common ones I have found are inconsistencies in column-formats and requirements, and plain-and-simple missing info.

Inconsistencies

If the data is inconsistent it makes it hard to trust and even harder to understand. Regardless of your data source, let’s assume that we are at least receiving the data in some sort of columnar format. Maybe we’re consuming a JSON feed from an API, or uploading an Excel file. Either way, the data is coming in as an array or object and we are storing that into our own database for the application to use later. If we don’t have ultimate control over the system we are requesting data from then we have to perform all of our checks on the data as it comes in.

When I call data inconsistent, I mean that you’re never guaranteed that each row will look similar in format to the last. I’m not referring to the data source being inconsistent or unreliable, that is a different issue altogether. Here we are assuming you know you’ll get a response, you just can’t trust that response will be valuable to you. You may have more or fewer columns than expected. Some columns might be strings in some rows, integers in others. Some columns may contain relationships for external systems, while others seem like they should do not. These types of inconsistencies in the data make it difficult (but not impossible) to work with. So how do we fix this?

Well, we can’t just magically FIX data, but we can choose how we handle it. There are two ways I approach inconsistent data.

Contracts

Attempt to establish a contract with the data provider. If this is an internal team, meet with them and describe your data needs. Get them to put in writing that they will provide you with the data you need and agree to an SLA. This will protect you and your team by making sure that they are agreeing to fix any data inconsistencies that occur on their end before sending them to you. This is clearly the best route for many reasons. First, it keeps everyone on the same page.

Having an SLA guarantees that both teams are satisfied with the data being provided and will ensure it stays that way. It also gives you a chance to define exactly what the data you will receive should look like, instead of having to constantly guess-and-check during development. Last, collaboration is always key. Working directly with the team that owns the data may prove to open more doors for automation in the future.

Retry and Notify

A less elegant but effective approach is to ensure your system is smart enough to handle the inconsistent data. Using an observer pattern you can fire events stating what your application is doing and tracking that status somewhere. You can log that the system requested a resource or a file was uploaded. If the data is found to be inconsistent in your verification processes you can refuse to parse it and retry later. This might mean 5 minutes later, this might mean a day later, that depends on your data needs.

After a certain number of tries and fails, you can then notify the admins of the app that there is an issue with the data. This allows the team to investigate the data source to fix the issue. This situation isn’t ideal, but it does solve some common edge-cases like when a server fails to provide data but still sends a 200-OK response, or when an Excel file being consumed from a shared drive is corrupt due to an upstream issue but is replaced by a valid file automatically. Last, it keeps the admins informed and allows them to provide the necessary action in a timely manner.

Missing Information

More often than I would like to admit, I’ve been told “this column is absolutely critical. it will never be empty” only to find that 20% of the example file provided for upload tests contains blanks in that field. That field is critical for many of the reports in the system to function, the database has the field marked as “not null,” we simply cannot consume the file in this state in its entirety. So what do we do? Throw the whole thing away and tell them to fix it? Ideally, yes, because your application is only a good as the data it’s holding, but that’s almost never going to be acceptable. Instead, we have to find a way to work through this.

No Contract, No Deal

First, if you don’t have some sort of SLA or guarantee from the data provider of the integrity of the incoming data, your best bet is to assume that every single field can possibly be blank and make sure you can handle those situations. Being able to consume the data no matter how it looks will allow your imports to work flawlessly but your reporting side will have to make accommodations to group the invalid data together to ensure it still gets reported to your users somehow.

Total Control

If these issues are few and far between, and I hope they are, you can provide the admins of your application with an override of the data so that they can provide the missing data for you. There are situations, of course, where this might actually be a valid situation. Consider a situation where you are stocking store shelves and a new product is ordered. Your system receives an update stating that Sugar Crunch Cereal will be stocked, but doesn’t contain a stocking amount.

Maybe the store manager decided he wanted to sell this new product, but hasn’t decided how much he will order yet. When the system catches this record with the missing stock information that is listed as vital and not nullable, it can be stored in a separate staging table. At this point, you can notify the admin, and they can log in and update the missing information when they have it.

Once edited, that record can be moved back into your regular table where it belongs. Obviously, if we’re talking millions of records this would never work, but in situations where the failure is rare and sometimes expected, this is a workable solution.

Lookups

The last thing you can consider as a hail-mary to fix missing data is to see if you can look up the missing information from another source. Consider an example situation: you are creating a sales and product report for a grocery store chain. They deliver their product and sales info to you but sometimes the nutritional information is missing from the product lines.

You want to deliver a report to them that shows whether high-calorie or low-calorie foods tend to sell better, so you need this nutritional data. Often, this data is available from a multitude of other sources so, if you know enough of the product information to guarantee you have the right product you could actually look this up from another source.

In best-case scenario, they have an internal service inside their company that you can use so you know that you’re getting information that they trust and have verified. Using an external source is always tricky and you’d have to use your best judgment as to whether or not you’d want to take a route that risky, but if you can find a trustworthy source (and I reiterate, look internally first) then this is a valuable backup-plan to have at the ready.

These are just a few of the major issues you can run into when developing a data-driven application but they are, from my perspective, some of the most common issues. I hope someone will find this helpful before starting their next application. Remember, you can’t always control the data you import into your application but you control everything once it enters your domain. And because of this, data integrity is your responsibility. Do what’s right for your app and your customers.