With each day, enterprises increasingly rely on data to make decisions.
This is true regardless of their industry: finance, media, retail, logistics, etc. Yet, the solutions that provide data to dashboards and ML models continue to grow in complexity.
This is due to several reasons, including:
The ability to process data from diverse sources at a low cost.
An explosion in the availability and variety of data tools (impacting collaboration and decision making, beyond technical work).
Tight dependencies among data assets managed by different departments within companies.
This need to run complex data pipelines with minimum rates of error in such modern environments has led to the rise of a new role: the Data Reliability Engineer.
Data Reliability Engineering (DRE) addresses data quality and availability problems.
Comprising practices from data engineering to system operations, DRE is emerging as its own field within the broader data domain.
In this post, we’ll look more closely at DRE, specifically, what it is, what the role involves, and how DRE relates to SRE and DevOps. Finally, we’ll touch on how to determine if your company needs to hire a Data Reliability Engineer.
Let’s start with a basic definition.
In 2003, Google assembled its first Site Reliability Engineering (SRE) team. Their goals were to create scalable software that operates with minimal downtime and latency. Over the years, SRE has evolved to become a must-have for service reliability.
DRE brings that same philosophy to data workloads. Although many of the problems data engineers have been dealing with—such as infrastructure and query performance—have been solved by cloud warehouses and modern extract-load-transform (ELT) solutions, handling data reliably is non-trivial.
To do so, we need lineage tracking, change management, pipeline monitoring, and cross-team communication among others. Therefore, DRE has the potential to require specialized roles in an organization.
DRE is more focused on post-production observability tooling. To address observability, DRE puts monitoring indicators into production environments and alerting mechanisms if certain thresholds are reached. Data quality indicators such as freshness, volume, and completeness are commonly used.
It is a team sport, which means that better results are achieved when everyone puts efforts towards the same goal. Data analysts and engineers who instrument the development environment with pre-production checks contribute substantially to successful DRE since this reduces the number of errors that happen in the first place.
As an illustration, consider a pipeline that consolidates user events on a weekly basis in order to refresh a machine learning model that generates recommendations. If the pipeline fails, the model will be trained on obsolete data, and the user might no longer be interested in the recommended items. In such a scenario, the final business outcome would be decreasing conversion rates and increasing churn rates.
How could DRE have helped? Data freshness is a relevant indicator for the ML model training process. This could be monitored daily by running a query that checks the last time user events were uploaded to the training dataset. For example, if the most recently updated dataset was two days old, then the DRE team would receive an SMS alert to take action in a timely fashion.
In regards to pre-production checks, here's a relevant case study that talks about improving data reliability.
DRE requires a specific set of skills, from data engineering to system operations, but not exactly with a developer’s mindset. Instead, the ability to monitor and optimize data pipelines is a must-have for Data Reliability Engineers.
Data Engineers are responsible for developing data pipelines and appropriately testing their code. However, Data Reliability Engineers are responsible for supporting the pipelines in production by monitoring the infrastructure and data quality. In other words, Data Engineering teams usually perform unit and regression tests that address known or predictable data issues before the code goes to production. DRE teams instrument the production environment to detect unknown problems before impacting the end-users.
Data Reliability Engineers work with other teams, taking care of databases, data pipelines, deployments, and the availability of these systems. This means they are expected to understand the following:
In the past decade, software development teams have widely adopted the use of containerization and continuous delivery practices in order to ship new features and bug fixes in less time. Data Reliability Engineers advocate for such practices within the data teams.
Finally, Data Reliability Engineers act as trusted advisors for the company, actively participating in data platform infrastructure design and scalability considerations.
SRE and DevOps are both mature areas from the software development landscape that inspired DRE. In some cases, SRE and DevOps have become the responsibility of the software engineer. In other cases, SRE and DevOps have spawned dedicated roles within a company. DRE is very likely to be tackled in the same way. The number of pipelines to support—along with their complexity and their availability requirements—are elements to consider when determining what it takes to implement DRE successfully within a company.
The similarities between DRE, SRE, and DevOps don’t just stop there.
SRE takes the approach that operations should be treated as a software problem. For example, SRE solutions should be developed in code, stored and managed by version control, and expected to evolve as new requirements arise. That same approach applies to DRE. One of the key benefits of this approach is failure recovery: the Git repository contains a detailed changelog for the DRE solution, making it easier to troubleshoot and restore old versions when a change introduces unintended behavior.
DRE shares an additional essential aspect with SRE and DevOps: team culture. In DRE, SRE, and DevOps, all team members are responsible for the success or failure of the team and its solutions. As mentioned before, Data Reliability Engineers are trusted advisors who advocate for best reliability practices inside the company. Still, the results will depend on how their teammates (managers, data engineers, infrastructure analysts, and so on) adhere to such practices.
When it comes to soft skills, DRE, SRE, and DevOps engineers must have excellent verbal and written communication as they play functions that interact with distinct roles within the company.
Given the topics discussed so far, you might wonder at what stage of the data journey companies should invest in DRE. We recommend that companies begin to invest in DRE when any of the following become true:
As with any investment, companies should be able to measure the ROI of DRE before deciding to invest more or less in it. Data teams are expected to discuss KPIs and set SLAs with the DRE team to do so. When the discussion starts, a baseline is recorded. Then, periodic plan-do-check-act (PDCA) meetings take place, to check the impact and set up subsequent actions.
Clear goals, supported by KPIs and SLAs, will also help Data Reliability Engineers to better design their working plans. It is not worth having seasoned engineers—usually with more than five years of experience in developing, monitoring, and optimizing data solutions—if the company cannot guide them on how to show their value.
Now that you know what DRE is, maybe it's time for the conversation in your company to dive more deeply into the topic. Or maybe you're looking up to level up your skills to seek out work as a DRE. One company out there—Datafold—has a set of core products that are helping to put DRE into operation:
Plug-and-play column-level lineage for reliable impact analysis. No developer resources are needed. Simply connect your data warehouse, and you can explore the lineage graph. SOC 2 compliant, the tool analyzes every SQL statement in your data warehouse and displays the graph of dependencies in an intuitive UX.
Data Diff provides easy one-click regression testing for ETL. It automates regression testing with integration into the CI process, supporting both GitHub and GitLab. Validate every source code change so that you can easily see how changes in your code impact the data produced across all rows and columns.
Data Monitoring allows DRE teams to get faster incident response times with code-free alerting in just one click. It’s powered by an ML model that learns how your data works to adapt to the seasonality and trends in your data to create dynamic thresholds.
Regardless of the next steps for you or your company, it's clear that the role of the DRE is here to stay. The complexity of tomorrow's data needs is only further compounded by the demand for data quality and availability. To meet this challenge head-on, it's time to embrace and nurture the role of the Data Reliability Engineer.
Previously published here.