188 reads

Good Data is in the Blood of Trusted Applications

by Jacob LandryDecember 4th, 2021

Too Long; Didn't Read

Let’s face it. Most of our applications are driven by data. Either they’re consuming it, creating it, reporting on it, or all of the above. Unfortunately, your app is really only as good as the data that supports it and most of the time, that can be out of your control. How should you be handling the multitudes of situations where you are dependent on external sources to provide data? What can you do to ensure that your system is protected? Determining Dependencies If you’re not uploading or generating the data yourself, you’re making your application reliant on other people or applications. This is a dependency. A dependency can be explained as simply as “any situation where you rely on someone other than your application to provide data.” This can be a manual upload, an automated API call, a database transfer from an external source, or even relying on an external data source altogether.

Companies Mentioned

featured image - Good Data is in the Blood of Trusted Applications

Let’s face it. Most of our applications are driven by data. Either they’re consuming it, creating it, reporting on it, or all of the above.

Unfortunately, your app is really only as good as the data that supports it and most of the time, that can be out of your control.

How should you be handling the multitudes of situations where you are dependent on external sources to provide data? What can you do to ensure that your system is protected?

Determining Dependencies

If you’re not uploading or generating the data yourself, you’re making your application reliant on other people or applications. This is a dependency.

A dependency can be explained as simply as “any situation where you rely on someone other than your application to provide data.” This can be a manual upload, an automated API call, a database transfer from an external source, or even relying on an external data source altogether.

These dependencies create risks that need to be mitigated if you’re going to be able to sleep at night.

There are hundreds of reasons that you might have a data dependency in your application, but I’m going to list a few that have been the most common for me, and what I’ve done so that I feel like my application is safe from any potential data-loss due to these dependencies.

Timing and Scheduling

Are you requesting data at a specific time? Or multiple times? Does this data need to be consumed at that exact time or it loses its value? These rigid requirements aren’t uncommon but create a fair amount of risk as you are reliant on two major factors.

Stay Online

First, your system has to be up and running at the time that you need to request or receive data. Server outages happen, none of us are perfect. If your application isn’t online to consume the data when it needs to be, there could be a problem.

This can be mitigated in a couple of different ways, the first of which is the simple fact of “don’t let your app fail.” I don’t mean don’t make mistakes, I mean if you have a need to be online at specific times, your app has to have 99+% up-time.

This means employing cloud-like solutions and containerized environments where a new version of your application can be automatically started if another one fails.

This also means you can have multiple versions of the application running at any given time.

Typically in this situation, I would actually have a Kubernetes pod dedicated strictly to consuming the scheduled resources. It’s not taking any other requests and runs a very low risk of ever failing. Even if the pod servicing the web front-end crashes, my scheduler remains up.

Seriously, Stay Online

Second, the system or person you are relying on providing the data needs to be available and ready at the time you need to request data. Again, server outages happen (or alarm clocks fail).

There’s always a possibility that the data will not be sent or provided at the time you expect it. You don’t have control over external sources or people, but you can control how your system behaves.

You can notify people or systems if they miss a schedule, or you can continue to check for the data to be provided for a certain time after a missed schedule.

For example you might assume you’re going to receive data every Tuesday at 9AM, but in the event that you don’t you check once an hour for the rest of the day until you receive data.

Upstream Sources

Similar to my last point above, any time you’re relying on external sources that are out of your control, there’s a lot of risk. This gets worse when you have many interconnected systems working together towards an end goal, none of which actually care about any of the applications below it.

For example, consider an application that collects sales data from point-of-sale systems in-store. That data might then be consumed and summarized by a finance team assessing the sales. Months later, that summarized data might be consumed by a second team to provide a sales forecast for the upcoming quarter.

The point-of-sale system has absolutely no knowledge of the forecasting system and likely never will. If the point-of-sale system fails, the forecast system suffers but the team handling the point-of-sale system is unlikely to know (or care).

This can get more complicated as you consider a combination of situations, like the above scheduling issues. What if the point-of-sale system goes offline for an update at the same time that the finance teams want to pull their summarized data. Because of this the data isn’t populated or present when the forecasting team tries to pull data. This data will be provided once the update is complete, but isn’t ready when the application expected it.

The larger your company is, the more likely it is that your data, whether you know it or not, is based on several upstream sources, all of which could fail or become unreliable at any given time. There’s really no way to solve this from an individual contributors perspective but there are definitely steps that you can take to help, or at least to protect your application.

The most important thing is to fully understand your data. Know the data and its sources so you can learn of these upstream systems. Reach out to the teams so you can be notified of outages, allowing you to be proactive in your own system management.

As you find issues, protect against them with various retry loops or manual overrides. Just be sure to clean out any data you need to remove from a previous attempt before retrying.

Live Updated Data

Live data is good, but it can also create confusion. Normally when you import data into your own tool, you’re going to want a “snapshot” of the live data at that given time. If it takes you several minutes to import your data and some of the records are being actively updated or changed while you import the data, it’s possible that things can become skewed.

Ensuring you are pulling in your live data in the correct order is critical to avoid these issues.

Consider the following situation: You’re pulling a list of vendors from an online store to build sales reports. Vendors can be added or removed at any given time as people sign up to sell their goods on a website.

Now consider you’re importing millions of vendors into your tool for reporting and you do this in pages of 500k records at a time, to save yourself some memory woes. You’re on page 3, importing data and you request page 4, only to find out that a vendor Abignail Twist was added mid-import to page 2, causing all the records to shift.

Now page 4’s first record is the same as page 3’s last record, and suddenly you have a duplicate. It’s easy to remove duplicates, but we shouldn’t have to. If our final page was meant to have 499 records on it and we add 5 more vendors, mid-process, that will create a new page that we were not intending to parse, as well (assuming we collected the number of page we needed prior to pulling data).

This could cause us to miss several records of data.

Work Smarter, not Harder

First, most intelligently designed systems are going to have an auto-incrementing ID for each record.

Sort the data you’re requesting by that id, ascending. This will ensure that any new records that get created mid-import will be added to the end of the last page and you won’t have any duplicate or missed records.

Snapshots

It’s also a good idea to look for any type of column you can use to make sure you’re getting the correct data. For example, if the data has an updated_at date time column, you can request all records that have an updated_at time to be less than now.

That means any new records that are added or changed while the import is processing will be ignored and can be consumed in your next refresh.

Soft Deletes

You can also check for soft-delete functionality. Perhaps if a vendor is deleted from the system they aren’t actually removed but only marked as “deleted” by some sort of deleted_at timestamp column.

In this case, you can request anything that has a null deleted_at value, or one that is greater than the time you started the request, ensuring you consume any newly deleted fields mid-pull as well.

Summary

These are just some of the many situations you can run into when working with external data sources. Be aware of what you’re pulling and protect your system.

In the end, that’s the best you can do you for the application and your customers.