In this first post in our 2-part ML Ops series, we are going to look at ML Ops and highlight how and why data quality is key to ML Ops workflows. Machine learning has been, and will continue to be, one of the biggest topics in data for the foreseeable future. And while we in the data community are all still riding the high of discovering and tuning predictive algorithms that can tell us whether a picture shows , we’re also beginning to realize that you can wave at a pile of data to quickly get insightful, reliable results. a dog or a blueberry muffin ML isn’t just a magic wand Instead, we are starting to treat ML like other software engineering disciplines that require processes and tooling to ensure seamless workflows and reliable outputs. , in particular, has been a consistent focus, as it often leads to issues that can go unnoticed for a long time, bring entire pipelines to a halt, and erode the trust of stakeholders in the reliability of their analytical insights: Data quality ”Poor data quality is Enemy #1 to the widespread, profitable use of machine learning, and for this reason, the growth of machine learning increases the importance of data cleansing and preparation. The quality demands of machine learning are steep, and bad data can backfire twice — first when training predictive models and second in the new data used by that model to inform future decisions.” ( ) tdwi blog In this post, we are going to look at ML Ops, a recent development in ML that bridges the gap between ML and traditional software engineering, and highlight how data quality is key to ML Ops workflows in order to accelerate data teams and maintain trust in your data. What is ML Ops? Let’s take a step back and first look at what we actually mean by “ML Ops”. The term evolved from the better-known concept of “DevOps”, which generally refers to the set of tools and practices that combines software development and IT operations. ML Ops The goal of DevOps is to and deployment throughout the entire development lifecycle while of software by streamlining and automating a lot of the steps required. accelerate software development ensuring the quality Some examples of DevOps most of us are familiar with are version control of code using tools such as git, code reviews, continuous integration (CI), i.e. the process of frequently merging code into a shared mainline, automated testing, and continuous deployment (CD), i.e. frequent automated merges of code into production. When applied to a machine learning context, the goals of ML Ops are very similar: . However, unlike with software development, ML deals with both code and data: to accelerate the development and production deployment of machine learning models while ensuring the quality of model outputs Machine learning starts with data that’s being ingested from various sources, cleaned, transformed, and stored using code. That data is then made available to data scientists who write code to engineer features, develop, train and test machine learning models, which, in turn, are eventually deployed to a production environment. In production, ML models exist as code that takes input data which, again, may be ingested from various sources, and create output data that’s used to feed into products and business processes. And while our description of this process is obviously simplified, it’s clear to see that are tightly coupled in a machine learning environment, and ML Ops need to take care of both. code and data Concretely, this means that ML Ops incorporates tasks such as: Version control of any code used for data transformations and model definitions Automated testing of the ingested data and model code before going into production Deployment of the model in production in a stable and scalable environment Monitoring of the model performance and output How does data testing and documentation fit into ML Ops? Let’s go back to the original goal of ML Ops: to accelerate the development and production deployment of machine learning models while ensuring the quality of model outputs. Of course, as data quality folks, we at Great Expectations believe that data testing and documentation are absolutely essential to accomplishing those key goals of acceleration and quality at various stages in the ML workflow: On the stakeholder side, poor data quality affects the trust stakeholders have in a system, which negatively impacts the ability to make decisions based on it. Or even worse, data quality issues that go unnoticed might lead to incorrect conclusions and wasted time rectifying those problems. On the engineering side, scrambling to fix data quality problems that were noticed by downstream consumers is one of the number one issues that cost teams time and slowly erodes team productivity and morale. Moreover, data is essential for all stakeholders to communicate about the data and establish data contracts: documentation “Here is what we know to be true about the data, and we want to ensure that continues to be the case.” In the following paragraphs, we’ll look at the individual stages in an ML pipeline at a very abstract level, and discuss how data testing and documentation fits into each stage. At the data ingestion stage Even at the earliest stages of working with a data set, establishing quality checks around your data and documenting those can immensely speed up operations in the long run. Solid data testing gives engineers confidence that they can safely make changes to ingestion pipelines without causing unwanted problems. At the same time, when ingesting data from internal and external upstream sources, data validation at the ingestion stage is absolutely critical to ensure that there are no unexpected changes to the data that go unnoticed. We’ve been trying really hard to avoid this cliché in this post, but here we go: . Thoroughly testing your input data is absolutely fundamental to ensuring your model output isn’t completely useless. Garbage in, garbage out When developing a model For the purpose of this article, we’ll consider feature engineering, model training, and model testing to all be part of the core model development process. During this often-iterative process, guardrails around the data transformation code and model output support data scientists so they can make changes in one place without potentially breaking things in others. In classic DevOps tradition, continuous testing via CI/CD workflows quickly elicits any issues introduced by modifications to code. And to go even further, most software engineering teams require developers to not just test their code using existing tests, but also . In the same way, we believe that running tests should be part of the ML model development process. add new tests when creating new features as well as writing new tests When running a model in production As with all things ML Ops, a model running in production depends on both the code and the data it is fed in order to produce reliable results. Similar to the data ingestion stage, we need to secure the data in order to avoid any unwanted issues stemming from either code changes or changes in the actual data. input At the same time, we should also have some testing around the model output to ensure that it continues to meet our expectations. We occasionally hear from data teams that a faulty value in their model had gone undetected for several weeks before anyone noticed (and in the worst case, they were alerted by their stakeholders before they detected the issue themselves). output Especially in an environment with black box ML models, establishing and maintaining standards for quality is crucial in order to trust the model output. In the same way, the expected output of a model in a shared place can help data teams and stakeholders define and communicate “data contracts” in order to increase transparency and trust in ML pipelines. documenting What’s next? By this point, it’s probably clear how data validation and documentation fit into ML Ops: namely by allowing you to implement tests against both your data and your code, at any stage in the ML Ops pipeline that we listed out above. We believe that data testing and documentation are going to become one of the key focus areas of ML Ops in the near future, with teams moving away from “homegrown” data testing solutions to off-the-shelf packages and platforms that provide sufficient expressivity and connectivity to meet their specific needs and environments. is one such data validation and documentation framework that lets users specify what they expect from their data in simple, declarative statements. Great Expectations Previously published at https://greatexpectations.io/blog/ml-ops-great-expectations/