paint-brush
How can a shared Slack channel improve your data quality?by@taavi-rehemagi

How can a shared Slack channel improve your data quality?

by Taavi RehemägiMay 25th, 2021
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

The simplest way to improve data ownership is to assign owners to the most critical data artifacts, i.e., specific tables in a data warehouse, data lake datasets, and data science models. As social creatures, we are more motivated to fix issues if other people can see the effort we put into it. The process can then evolve by adding automation and monitoring dashboards for more visibility. The important part of the process is to send those alerts to a Slack channel which is shared across the data team.

People Mentioned

Mention Thumbnail

Company Mentioned

Mention Thumbnail
featured image - How can a shared Slack channel improve your data quality?
Taavi Rehemägi HackerNoon profile picture

Have you ever heard anyone saying: “Our data is great, we’ve never had any data quality issues“? Ensuring data quality is hard. The magnitude of the problem makes us believe that we need some really big actions to make any improvements. But the reality shows, often the simplest and most intuitive solutions can be incredibly impactful. In this article, we’ll look at one idea to improve the process around data quality, and make it more rewarding and actionable.

Often the most impactful changes come from rethinking our processes.

Taking ownership of data

Regardless of how your data team is structured (centralized BI/Data Team vs. decentralized domain-oriented teams leveraging Data Mesh paradigm), people need to take ownership to make any lasting and impactful change. If nobody feels responsible for fixing an issue in the data, we shouldn’t expect that the situation will get better, regardless of the tools we use. 

How can we approach this?

The simplest way to improve data ownership is to assign owners to the most critical data artifacts, i.e., specific tables in a data warehouse, data lake datasets, and data science models. It’s not that we want to blame those people for data quality issues. Instead, assigning owners can create more transparency over who should look after specific data assets and do whatever they can to ensure that this data remains clean. The process can then evolve by adding automation and monitoring dashboards for more visibility.

In a nutshell, before taking any tools or automation scripts into consideration, it’s helpful to first think about a process of establishing data ownership.

Making the process more rewarding and easier to track

Once the ownership is defined, we can improve the process by making quality checks more rewarding and automated. Adding simple scripts that execute data quality checks and notify data owners via a shared Slack channel about any data quality issues can be highly effective to increase the team’s engagement in improving data quality.

The important part of the process is to send those alerts to a Slack channel which is shared across the data team. As social creatures, we are more motivated to fix issues if other people can see the effort we put into it. For instance, the data owner who took care of the issue can:

- send a reply explaining what was the root cause and which steps have been taken to solve the problem,

- simply adding a checkmark to show that this issue has been taken care of,

- or adding a ticket link if the issue turned out to be much more complex and needs to be put into a backlog.

An example of a shared Data Quality Slack channel with users engaged in the process 

All of the above actions add visibility and a (social) proof that data quality issues are no longer ignored. It demonstrates how taking ownership and making the process more socially rewarding can already yield tangible improvements.

Leveraging automation to facilitate the process

Let’s assume that we established the process and agreed on data ownership. How can we go about implementing those automated data quality alerts? The process can be as simple as:

- building SQL queries that check for anomalies in the data,

- writing a script that sends a Slack notification if the alert condition is met,

- creating a shared Slack channel and a webhook to send messages to it.

First, to create a webhook, go to https://api.slack.com/apps → Create an App.

Add a name to your app and select your desired Slack workspace.

Select incoming webhooks and create one for the Slack channel of your choice (“Add New Webhook to Workspace”).

Once all that’s done, you can copy your Webhook URL and use it in your Python script. Note that you should treat this webhook in the same way as an API key or password

The script to build the alert is as simple as sending a POST request to the Slack API endpoint represented by the webhook (line 19 in the Gist below). 

Note that on line 35, the Webhook URL is retrieved from AWS Secrets Manager. If you want to follow the same approach to store this confidential piece of information, make sure to add it to your set of secrets:

aws secretsmanager create-secret --name slack-webhook --secret-string '{"hook_url": "YOUR_HOOK_URL"}'

About the checks from the example

In this code example, we are checking whether order status and payment type match the expected (allowedvalues. If not, we should receive a Slack message informing us about the outliers:

Obviously, those checks represent rather contrived examples (based on this e-commerce dataset from Kaggle). In a real-world scenario, your data quality checks may validate:

whether a specific KPI in your data reaches some critical value or when it’s surpassing the expected range of values,the occurrence of highly improbable values (B2C-customer buying hundreds of items of the same product),whether some values (for instance, marketing, payment, or logistic costs) significantly deviate from planned values,whether data is up-to-date, complete, duplicate-free, and without missing values,…and many more.

Deploying the scripts to AWS

To make running those periodic checks more scalable, we could leverage AWS Lambda. To make the previously shown Github gist work with Lambda, we need to wrap our main execution code into a lambda handler (starting from line 34). Also, we need to ensure our logger is defined globally in a way that is compliant with AWS Lambda.

The full project is available in this Github repository.

To deploy our container image to AWS, we build and push our container image to ECR (123456 is a placeholder for AWS Account ID). 

Then, in the Lambda configuration, we select our desired container image, as shown below. 

Since performing database queries can be time-consuming, we need to increase the timeout setting. Also, increasing the memory size to at least 256 MB seems reasonable since data returned by the queries can take a larger amount of space in memory.

Make sure to add relevant IAM policies. For this example, we need Athena and S3 permissions.

Lastly, to ensure that our checks run on schedule, we need to add a CloudWatch schedule as a trigger:

Additionally, we can test the function using an empty JSON payload:

How can we further improve the automated process?

So far, we established a semi-automated process ensuring that the owners are notified about data quality issues occurring in their data assets. We also started building automated checks to regularly validate our data. Now we can start thinking about how to run it at scale. While AWS Lambda provides highly scalable compute resources, it makes it difficult to track and fix errors occurring in your serverless functions. 

One possible solution is to leverage a serverless observability platform such as Dashbird. The initial onboarding is as simple as clicking a single button to deploy a CloudFormation template within your AWS account. Once the CloudFormation stack is finished, you can immediately start using the dashboards, configure alerts on failure, and dive into beautifully formatted log messages

Dashbird observability platform demonstrating a cold start in AWS Lambda 

Drawbacks of the presented approach

The first problem in the demonstrated approach is that we need some stateful logic to ensure that we don’t notify about the same problems too frequently, otherwise people will start ignoring the alerts and will possibly mute the Slack channel. Also, the social aspect might get lost if there are too many messages.

Additionally, writing all sorts of data quality checks by ourselves is not scalable, and possibly not even feasible if you deal with vast amounts of data. As Barr Moses points out, such quality checks can only cover the known unknowns, i.e. problems that can be anticipated. An interesting idea would be to combine the social aspect of a shared Slack channel with alerts from end-to-end observability pipelines.

Conclusion

In this article, we looked at how simple process adjustments can increase the team’s engagement and commitment to improving data quality. Often the most impactful changes don’t require any big decisions or investments but from rethinking our processes, adding automation to enhance their execution, and ensuring that the entire team keeps working together on a common goal of improving data quality

Previously Published on https://dashbird.io/blog/improve-data-quality-slack/