The terms ‘MLOps’ and ‘AIOps’ are appearing more and more. Many from a traditional DevOps background might wonder why this isn’t just called ‘DevOps’. In this article we’ll explain why MLOps is so different from mainstream DevOps and see why it poses new challenges for the industry.
Current State of DevOps vs MLOps
DevOps is now a relatively well-established set of practices based around CI/CD and infrastructure. DevOps practitioners put tools and processes in place to realise faster time to value and greater governance for software development projects. The space of tools includes git, Jenkins, Jira, docker, kubernetes etc.:
MLOps has not achieved the same level of maturity. As much as 87% of machine learning projects never go live. ML infrastructure is complex and workflows extend beyond production of artifacts to include data collection, prep and validation. The types of hardware resources involved can be specialised (e.g. GPUs) and require management. The data flowing through the model and the quality of predictions can also require monitoring, resulting in a complex MLOps landscape:
Why So Different?
The driver behind all these differences can be found in what machine learning is and how it is practised. Software performs actions in response to inputs and in this ML and mainstream programming are alike. But the way actions are codified differs greatly.
Traditional software codifies actions as explicit rules. The simplest programming examples tend to be ‘hello world’ programs that simply codify that a program should output ‘hello world’. Further control structures can then be added to add more complex ways to perform actions in response to inputs. As we add more control structures, we learn more of the programming language. This rule-based input-output pattern is easy to understand in relation to older terminal systems where inputs are all via the keyboard and outputs are almost all text. But it also true of most of the software we interact with, though the types of inputs and outputs can be very diverse and complex.
ML does not codify explicitly. Instead rules are indirectly set by capturing patterns from data. This makes ML more suitable for a more focused type of problem that can be treated numerically. For example, predicting salary from data points/features such as experience, education, location etc. This is a case of a regression problem, where the aim is to predict the value of a variable (salary) from the values of other variables by use of previous data. Machine learning is also used for classification problems, where instead of predicting a value for a variable, instead the model outputs a probability that a data point falls into a particular class. Example classification problems are:
- Given hand-written samples for numbers, predict which number is which.
- Classify images of objects according to category e.g. types of flowers
We don’t need to understand all the details here of how ML is done. However, it will help to have a picture of how ML models are trained. So let’s consider at a high level what is involved in a regression problem such as predicting salary from data for experience, education, location etc. This can be addressed by programmatically drawing a line through the data points:
The line is embodied in an equation:
The coefficients/weights get set to initial values (e.g. at random). The equation can then be used on the training data set to make predictions. In the first run the predictions are likely to be poor. Exactly how poor can be measured in the error, which is the sum of the distances of all the output variable (e.g. salary) samples from the prediction line. We can then update the weights to try to reduce the error and repeat the process of making new predictions and updating the weights. This process is called 'fitting' or 'training' and the end result is a set of weights that can be used to make predictions.
So the basic picture centres on running training iterations to update weights to progessively improve predictions. This helps to reveal how ML is different from traditional programming. The key points to take away from this from a DevOps perspective are:
- The training data and the code together drive fitting.
- The closest thing to an executable is a trained/weighted model. These vary by ML toolkit (tensorflow, sc-kit learn, R, h2o, etc.) and model type.
- Retraining can be necessary. For example, if your model is making predictions for data that varies a lot by season, such as predictions for how many items of types of clothing will sell in a month. In that case training on data from summer may give good predictions in summer but will not give good predictions in winter.
- Data volumes can be large and training can take a long time.
- The data scientist’s working process is exploratory and visualisations can be an important part of it.
This leads to different workflows for traditional programming and ML development.
With traditional programming a workflow might be as follows:
- User Story
- Write code
- Submit PR
- Tests run automatically
- Review and merge
- New version builds
- Built executable deployed to environment
- Further tests
- Promote to next environment
- More tests etc.
- Monitor - stacktraces or error codes
Typically the trigger for a build is a code change in git. The packaging for an executable is normally docker.
With machine learning the driver for a build might be a code change. Or it might be new data. The data likely won’t be in git due to its size. Any tests are not likely to be a simple pass/fail since you’re looking for quantifiable performance. One might choose to express performance criteria numerically by tolerating a certain error level. What might be acceptable can vary a lot by business context. For example, consider a model that predicts a likelihood of a financial transaction being fraudulent. Then there may be little risk in predicting good transactions as fraudulent so long as the customer is not impacted directly (there may be a manual follow-up). But predicting bad transactions as good could be very high risk.
The ML workflow can also differ depending on whether the model can learn while it is being used (online learning) or if the training takes place separately from making live predictions (offline learning). For simplicity let’s assume the training takes place separately. In that case a high-level workflow could look like:
- Data inputs and outputs. Preprocessed. Large.
- Data scientist tries stuff locally with a slice of data.
- Data scientist tries with more data as long-running experiments.
- Collaboration - often in jupyter notebooks & git
- Model may be pickled/serialized
- Integrate into a running app e.g. add REST API (serving)
- Integration test with app.
- Rollout and monitor performance metrics
As suggested already, the monitoring for performance metrics part can be particularly challenging and may involve business decisions. For example, let’s say we have a model being used in an online store and we’ve produced a new version. In these cases it is common to check the performance of the new version by performing an A/B test. This means that a percentage of live traffic is given to the existing model (A) and a percentage to the new model (B). Let’s say that over the period of the A/B test we find that B leads to more conversions/purchases. But what if it also correlates with more negative reviews or more users leaving the site entirely or is just slower to respond to requests? A business decision may be needed.
The role of MLOps is to support this whole flow of training, serving, rollout and monitoring. Let's better understand the differences from mainstream DevOps by looking at some MLOps practices and tools for each stage of this flow.
For many cases training jobs can be run on the data scientist’s local machine. However, as the size of the dataset or processing grows then local execution can become impractical. Then a tool will be needed that can leverage specialised cloud hardware, parallelize steps and allow long-running jobs to run unattended. There are a number of tools for this, for instance kubeflow pipelines:
Individual steps can be broken out as reusable operations and run in parallel if desired. This helps address needs to include steps to split out the data into segments and apply cleaning and pre-processing on the data. The UI allows for monitoring and inspection of the progress of steps. Runs can also be given different parameters and executed in parallel. This allows data scientists to experiment with different parameters and see which result in a better model. Similar functionality is provided by MLFlow experiments, polyaxon and others.
Some training platforms can also be used for Continuous Integration. For example, a training run could be triggered on a commit to git and the model could be pushed from the job for it to be available to make live predictions. As noted before, deciding whether a model is good for live use can involve a complex mixture of factors. It might be that the main factors can be tested adequately at the training stage (e.g. model accuracy on test data). Or it might be that only initial checks are done at the training stage and the new version is only cautiously rolled out for live predictions. We’ll look at rollout and monitoring later - first we should understand what live predictions can mean.
Live Predictions and Model Serving
For some models there may be predictions to be made on a file of data points or a new file each week. This kind of scenario would be offline predictions. In other cases predictions need to be made on demand. For live use-cases typically the model is made available to respond to HTTP requests. This is called serving.
One approach to serving is to package a model by serializing it as a python pickle file and hosting that for the serving solution to load it. For example, this is serving manifest for kubernetes using the Seldon serving solution (a tool on which I work):
apiVersion: machinelearning.seldon.io/v1alpha2 kind: SeldonDeployment metadata: name: sklearn spec: name: iris predictors: - graph: children:  implementation: SKLEARN_SERVER modelUri: gs://seldon-models/sklearn/iris name: classifier name: default replicas: 1
The ‘SeldonDeployment’ is a kubernetes custom resource. Within that resource it needs to be specified which toolkit was used to build the model (here sci-kit learn) and where to obtain the model (in this case a google storage bucket). Some serving solutions also cater for the model to be baked into a docker image but python pickles are common as a convenient option for data scientists. Submitting this resource to kubernetes will make an HTTP endpoint available that can be called to get predictions. Often the serving solution will automatically apply any needed routing/gateway configuration needed, so that data scientists don’t have to do so manually.
Self-service for data scientists can also be important for rollout. This can need careful handling because the model has been trained on a particular slice of data and that data might turn out to differ from live. The key strategies used to reduce the risk of this are:
1) Canary rollouts
With a canary rollout a percentage of the live traffic is routed to the new model while most of the traffic goes to the existing version. This is run for a short period of time as a check before switching all traffic to the new model.
2) A/B Test
With an A/B test the traffic is split between two versions of a model for a longer period of time. The test may run until a sufficient sample size is obtained to compare metrics for the two models. For some serving solutions (e.g. Seldon, KFServing) the traffic-splitting part of this can be handled by setting percentage values in the serving resource/descriptor. Again, this is to enable data scientists to set this without getting into the details of traffic-routing or having to make a request to DevOps.
With shadowing all traffic is sent to both existing and new versions of the model. Only the existing/live version of the model’s predictions are returned as responses to live requests. The non-live model’s predictions are not returned and instead are just tracked to see how well it is performing.
Deciding between different versions of a model naturally requires monitoring.
With mainstream web apps it is common to monitor requests to pick up on any HTTP error codes or an increase in latency. With machine learning the monitoring can need to go much deeper into domain-specific metrics. For example, for a model making recommendations on a website it can be important to track metrics such as how often a customer makes a purchase vs chooses not to make a purchase or goes to another page vs leaves the site.
It can also be important to monitor the data points in the requests to see whether they are approximately in line with the data that the model was trained on. If a particular data point is radically different from any in the training set then the quality of prediction for that data point could be poor. It is termed an ‘outlier’ and in cases where poor predictions carry high risk then it can be valuable to monitor for outliers. If a large number of data points differ radically from the training data then the model risks giving poor predictions across the board - this is termed ‘concept drift’. Monitoring for these is fairly advanced as the boundaries for outliers likely need to be set algorithmically by a data scientist.
For metrics that can be monitored in real-time it may be sufficient to expose dashboards with a tool such as grafana. However, sometimes the information that reveals whether a prediction was good or not is only available much later. For example, there may be a customer account opening process that flags a customer as risky. This could lead to a human investigation and only later will it be decided whether the customer was risky or not. For this reason it can be important to log the entire request and the prediction and also store the final decision. Then offline analysis run over a longer period can provide a wider view of how well the model is performing.
Support for custom metrics, request logging and advanced monitoring varies across serving solutions. In some cases a serving solution comes with out of the box integrations (e.g. Seldon) and in other cases the necessary infrastructure may have to be setup and configured separately.
If something goes wrong with running software at a given point in time then we need to be able to recreate the circumstances of the failure. With mainstream applications this typically means tracking which code version was running in the form of an executable (docker image), which code commit it tracks back to and some information about the data state of the system at the time. That enables a developer to recreate that execution path in the source code. Taken to its fullest, the equivalent for machine learning would be much more extensive. It would involve knowing exactly what data was sent in (full request logging), which version of the model was running (not necessarily a docker image, a python pickle likely but could be various formats), what source code was used to build it, what parameters were set on the training run and what data was used for training. The data part can be particularly challenging as this means retaining the data from every training run that goes to live and in a form that can be used to recreate models, so any transformations on the data would need to be tracked and reproducible.
The tool scene for tracking across the ML lifecycle is currently dynamic. There are tools such as ModelDB, kubeflow metadata, pachyderm and Data Version Control (DVC), among others. Part of the challenge is that as yet few standards have emerged as to what to track and how to track it. Naturally one would want smooth integrations between tracking tools and the other tools used across the cycle, especially model training platforms. Typically platforms currently just integrate to a particular chosen tool or leave it to the users of the platform to build any tracking they need into their own code.
There are also wider governance challenges for ML concerning bias and ethics. Without care models might end up being trained using data-points that a human would consider unethical to use in decision-making. For instance, a loan approval system might be trained on historic loan repayment data. Without a conscious decision about which data points are to be used, it might end up making decisions based on Race or Gender.
Given concerns about bias, some organisations are putting an emphasis on being able to explain why a model made the prediction that it did in a given circumstance. To achieve this it is not only necessary to know the request and the version of the model and training data - being able to explain why a prediction was made can be a data science problem in itself. Some types of models such as neural networks are being referred to as ‘black box’ as it is not easy to see why a prediction would come about from inspecting their internal structure. There are black-box explanation techniques emerging (such as Seldon's Alibi library) but for now many organisations for whom explainability is a key concern are currently sticking to white box modelling techniques.
MLOps is an emerging area. MLOps practices are distinct from mainstream DevOps because the ML development lifecycle and artifacts are different. There are a wide range of MLOps tools available but most are young and compared with mainstream DevOps the tools may not yet interoperate very well. There are some initiatives towards standardisation but currently the landscape is quite splintered with big commercial players (including major cloud providers) each focusing primarily on their own end-to-end ML platform offering. Large organisations are having to choose whether an end-to-end offering meets their machine learning platform needs or if they instead want to assemble a platform themselves from individual (likely open source) tools.