Have you ever heard anyone saying: “ “? . The magnitude of the problem makes us believe that we need some really big actions to make any improvements. But the reality shows, often the . In this article, we’ll look at one idea to , and make it more rewarding and actionable. Our data is great, we’ve never had any data quality issues Ensuring data quality is hard simplest and most intuitive solutions can be incredibly impactful improve the process around data quality Often the most impactful changes come from rethinking our processes. Taking ownership of data Regardless of how your ( ), people need to to make any lasting and impactful change. If nobody feels responsible for fixing an issue in the data, we shouldn’t expect that the situation will get better, regardless of the tools we use. data team is structured centralized BI/Data Team vs. decentralized domain-oriented teams leveraging Data Mesh paradigm take ownership How can we approach this? The simplest way to is to to the , i.e., specific tables in a data warehouse, data lake datasets, and data science models. It’s not that we want to blame those people for data quality issues. Instead, assigning owners can create over who should look after specific data assets and do whatever they can to ensure that this . The process can then evolve by and . improve data ownership assign owners most critical data artifacts more transparency data remains clean adding automation monitoring dashboards for more visibility In a nutshell, before taking any tools or automation scripts into consideration, it’s helpful to first think about a process of establishing data ownership. Making the process more rewarding and easier to track Once the ownership is defined, we can improve the process by . Adding that and via a shared Slack channel about any can be highly effective to in improving data quality. making quality checks more rewarding and automated simple scripts execute data quality checks notify data owners data quality issues increase the team’s engagement The important part of the process is to send those which is across the data team. As social creatures, . For instance, the data owner who took care of the issue can: alerts to a Slack channel shared we are more motivated to fix issues if other people can see the effort we put into it - send a explaining what was the root cause and which steps have been taken to solve the problem, reply - simply adding a to show that this issue has been taken care of, checkmark - or if the issue turned out to be much more complex and needs to be put into a backlog. adding a ticket link An example of a shared Data Quality Slack channel with users engaged in the process All of the above actions and a (social) proof that . It demonstrates how taking ownership and making the process more socially rewarding can already . add visibility data quality issues are no longer ignored yield tangible improvements Leveraging automation to facilitate the process Let’s assume that we established the process and agreed on data ownership. How can we go about implementing those automated data quality alerts? The process can be as simple as: - building that check for anomalies in the data, SQL queries - writing a script that sends a if the alert condition is met, Slack notification - creating a and a webhook to send messages to it. shared Slack channel First, to create a webhook, go to → Create an App. https://api.slack.com/apps Add a name to your app and select your desired Slack workspace. Select and create one for the Slack channel of your choice (“Add New Webhook to Workspace”). incoming webhooks Once all that’s done, you can copy your Webhook URL and use it in your Python script. Note that you should treat this webhook in the same way as an . API key or password The script to build the alert is as simple as sending a POST request to the Slack API endpoint represented by the webhook ( ). line 19 in the Gist below Note that on line 35, the Webhook URL is retrieved from . If you want to follow the same approach to store this confidential piece of information, make sure to add it to your set of secrets: AWS Secrets Manager aws secretsmanager create-secret --name slack-webhook --secret-string '{"hook_url": "YOUR_HOOK_URL"}' About the checks from the example In this code example, we are checking whether ( ) . If not, we should receive a Slack message informing us about the outliers: order status and payment type match the expected allowed values Obviously, those checks represent rather contrived examples ( ). In a real-world scenario, your data quality checks may validate: based on this e-commerce dataset from Kaggle whether a specific KPI in your data reaches some critical value or when it’s surpassing the expected range of values,the occurrence of highly improbable values ( ),whether some values ( ) significantly deviate from planned values,whether data is up-to-date, complete, duplicate-free, and without missing values,…and many more. B2C-customer buying hundreds of items of the same product for instance, marketing, payment, or logistic costs Deploying the scripts to AWS To make running those periodic checks , we could leverage . To make the previously shown Github gist work with Lambda, we need to wrap our ( ). Also, we need to ensure our logger is defined globally in a way that is . more scalable AWS Lambda main execution code into a lambda handler starting from line 34 compliant with AWS Lambda The full project is available in this Github repository . To deploy our container image to AWS, we build and push our container image to ECR ( ). 123456 is a placeholder for AWS Account ID Then, in the Lambda configuration, we select our desired container image, as shown below. Since performing database queries can be time-consuming, we need to . Also, seems reasonable since data returned by the queries can take a . increase the timeout setting increasing the memory size to at least 256 MB larger amount of space in memory Make sure to add . For this example, we need Athena and S3 permissions. relevant IAM policies Lastly, to ensure that our checks run on schedule, we need to : add a CloudWatch schedule as a trigger Additionally, we can : test the function using an empty JSON payload How can we further improve the automated process? So far, we ensuring that the occurring in their data assets. We also started building to regularly . Now we can start thinking about . While provides , it makes it occurring in your serverless functions. established a semi-automated process owners are notified about data quality issues automated checks validate our data how to run it at scale AWS Lambda highly scalable compute resources difficult to track and fix errors One possible solution is to leverage a such as . The initial onboarding is as simple as to within your AWS account. Once the CloudFormation stack is finished, you can start using the dashboards, , and dive into . serverless observability platform Dashbird clicking a single button deploy a CloudFormation template immediately configure alerts on failure beautifully formatted log messages Dashbird observability platform demonstrating a cold start in AWS Lambda Drawbacks of the presented approach The first problem in the demonstrated approach is that we to ensure that we don’t , otherwise people will start ignoring the alerts and will possibly mute the Slack channel. Also, the social aspect might get lost if there are too many messages. need some stateful logic notify about the same problems too frequently Additionally, writing all sorts of by ourselves is , and if you deal with vast amounts of data. As points out, such quality checks can only cover the , i.e. problems that can be anticipated. An interesting idea would be to of a shared Slack channel with . data quality checks not scalable possibly not even feasible Barr Moses known unknowns combine the social aspect alerts from end-to-end observability pipelines Conclusion In this article, we looked at how can and . Often the most impactful changes don’t require any big decisions or investments but from rethinking our processes, to enhance their execution, and ensuring that the entire team keeps working together on a common goal of . simple process adjustments increase the team’s engagement commitment to improving data quality adding automation improving data quality Previously Published on https://dashbird.io/blog/improve-data-quality-slack/