In today's data-driven world, having a resilient data pipeline is crucial for businesses to extract valuable insights and make informed decisions.
Disclaimer: This article includes an affiliate link to Dromo. You won’t be charged anything extra for clicking on these links, but you will be supporting me if you choose to do so.
While incorporating data validation logic into a data pipeline is a relatively straightforward task for developers, challenges arise when the validation logic needs updating or when validation fails.
Traditionally, organizations have relied on engineers to troubleshoot and fix these issues, leading to inefficiencies and an unsustainable system. The key to a robust data pipeline lies in empowering non-technical team members to maintain data quality and handle bad data.
When data is progressing through a pipeline and then fails a validation check, it is common practice to block it from progressing further and resort to one of the following approaches:
Triggering an error in an application monitoring system: Although this approach alerts engineers to the problem, flagging validation errors in a platform like Sentry does not make them easier or more pleasant to fix. Debugging more traditional performance issues - such as network errors or memory leaks - will almost always take precedence over investigating one-off validation errors.
Logging the error to a server: As in the first example, this approach captures that a validation error has occurred. But now an engineer needs to sift through log files just to find these issues, which tends to happen sporadically at best. Data validation alerts often languish in the depths of log files, when they need to be resolved in real-time.
Writing errors to a report: In this approach, alerts are written to an Excel file or JSON document and then shared directly with a stakeholder via email or Slack. This is the most effective of the three approaches but requires a lot of work to set up and is quite rare. Notably, this approach still necessitates collaboration between the report reader and an engineer to find the original data file, understand the error message, and resubmit the corrected data.
The most common outcome across the board is that bad data is simply blocked for much longer than anyone wants. Engineers are slow to fix bad data because they would rather be building and refining systems and infrastructure, and non-technical stakeholders are slow because they are either not in the engineering loop, or not equipped with the right tools.
These traditional approaches result in organizations struggling to maintain resilient data pipelines and engineers being bogged down with one-off data fixes instead of building and refining infrastructure. The missing ingredient shared across all of these approaches is an intuitive yet powerful UI that anyone at the organization can use to fix bad data promptly and then keep it moving through the pipeline.
To create a truly robust data pipeline, organizations must find ways to effectively repair bad data by removing engineers from this process altogether.
This can be achieved by empowering non-technical team members across product, operations, and customer success to:
Review non-validating data in real-time: Providing a familiar workbook interface for non-technical team members to review and correct data issues streamline the process and reduce the need for engineering intervention.
Recirculate good data back through the pipeline: Once data has been fixed, non-technical team members should be able to reintroduce it into the pipeline, increasing the throughput and velocity of high-quality data in the system
Maintain validation logic directly, in some cases: Engineers frequently define validation rules directly in code with if/else conditions or using a popular library like Great Expectations, validate.js, or JSON Schema. However, in situations where business users control the validation logic, variations of the logic should be applied depending on the data, or the logic changes frequently, it is much preferable for non-technical stakeholders to write and maintain validation rules themselves. This way, the organization can maintain high data quality standards without waiting for engineers to hardcode new logic
Several software companies, such as
Dromo's platform focuses on empowering non-technical team members to define and maintain validation logic, review non-validating data in a familiar interface, and recirculate data back through the pipeline after it has been fixed.
By adopting solutions, businesses can create more efficient and sustainable data pipelines that truly unlock the full potential of their data.
In summary, the key to a resilient data pipeline lies in empowering non-technical team members to maintain data quality and handle bad data in real-time.
By removing engineers from the process of fixing data issues, organizations can create a more sustainable system that allows engineers to focus on what they do best: building and refining the system itself. Many companies are making this approach more accessible, allowing businesses to unlock the full potential of their data and drive better decision-making.