Hi Hackernoon! In this article, I will discuss the concept of MLOps in detail. Moreover, I will do it in 3 ways. First, theoretically - through the most sensible MLOps scheme. Then, conceptually, through the artifacts that are embedded in the approach. And finally, through understanding MLOps as an information system.
So, let’s start.
This question has long occupied the minds of many ML systems development teams. By such a system in this article, I will understand an information system, one or more components of which contain a trained model that performs some part of the overall business logic.
Like any other component of the system, this part of the business logic needs to be updated to meet changing business and customer expectations. MLOps is all about this regular update.
There has yet to be a single and agreed definition of MLOps. Many authors have tried to give it, but it was challenging to find a clear and systematic description at the same time.
There is one that could be considered such:
MLOps is an engineering discipline that aims to unify ML systems development(dev) and ML systems deployment(ops) to standardize and streamline the continuous delivery of high-performing models in production.
Let's highlight the critical parts of the MLOps definition:
So, MLOps is a kind of DevOps for ML models.
It is easy to believe that such an engineering discipline originated in a large IT company. So we can trust the theory that MLOps, as a meaningful approach, originated at Google, where D. Sculley was trying to save his nerves and time from the mundane tasks around outputting ML models to extensions. D. Sculley is now widely known as "The Godfather of MLOps" - the podcast of the same name is easy to find online.
D. Sculley began to consider the work with models from the point of view of the team's technical debt. Yes, it is possible to release new versions of models quickly, but the cost of supporting the resulting system will have a significant impact on the company's budget.
His work resulted in the paper "
Like most IT processes, MLOps have maturity levels. They help companies understand where they are now and how they can move to the next level (if there is such a goal). Also, generally accepted maturity determination methods allow you to determine your place among competitors.
The most thoroughly described and largely understandable model is from the analytics firm GigaOm. It is closest to the Capability Maturity Model Integration (CMMI) of all the models. This is a set of methodologies for improving organizational processes, which also consists of 5 levels - from 0 to 4.
The model from GigaOm unpacks each maturity level through 5 categories: strategy, architecture, modeling, processes, and governance.
Guided by this model in the early stages of building an ML system, you can think ahead about essential aspects and reduce the chances of failure. Moving from one maturity level to a higher one presents the team with new challenges that they may not have realized existed before.
It's worth noting that Google also has its MLOps maturity levels model. It was one of the first to appear. It is concise and consists of 3 levels:
It's hard to escape the thought that this model resembles an instruction manual for painting an owl. First, do everything by hand, assemble the ML pipelines, and finalize the MLOps. That said, it is well described.
Today, many large companies using ML have compiled their maturity models.
All the highlighted models converge in approximately one thing. At the zero level, they have the absence of any ML practices. At the last level, they have the automation of MLOps processes. There's always something different in the middle that is related to incremental process automation. Azure has this automation of the training process and then model deployment.
How do you describe all the processes associated with the concept of MLOps? Surprisingly, three Germans - the authors of the article "
It can be intimidating as it has many elements interacting with each other. However, many of the characteristics of the maturity levels mentioned above can be found in it. At least automated Pipelines, CI/CD, Monitoring, Model Registry, Workflow Orchestration, and Serving Component.
Let's discuss this diagram and talk about each one in more detail.
The main parts of the scheme are horizontal blocks, within which the procedural aspects of MLOps are described (they are assigned letters A, B, C, and D). Each of them is designed to solve specific tasks within the framework of ensuring the smooth operation of ML services of the company. For ease of understanding the scheme, it is better to start out of order.
If a company has ML services, employees work in Jupyter. In many companies, this is the tool where all ML development processes are centered. This is where most of the tasks that require implementing MLOps practices start.
Let's consider an example. Company A needs to automate a part of some processes using machine learning (let's assume that there is a corresponding department and specialists). It is unlikely that the way to solve the task is known in advance. Therefore, the executors need to study the problem statement and test possible ways of its realization.
To do this, an ML engineer/ML developer writes code for various task implementations and evaluates the results obtained by target metrics. All this is almost always done in Jupyter Lab. In such a form, it is necessary to capture a lot of important information manually and then compare implementations among themselves.
Such activity is called experimentation. It means obtaining a working ML model, which can be further used to solve relevant problems.
Block C shown in the diagram describes the process of conducting ML experiments.
Many decisions in ML development are made based on analyzing the data available in the company. It is not possible to train a model with target quality metrics on low-quality data or data that does not exist.
Therefore, it is important to figure out what data we have gotten and what we can do with it. To do this, for example, we can:
A better understanding of the data can only be obtained when coupled with semantic and structural analysis.
Only sometimes, however, data preparation is within the project team's control. In this case, additional difficulties are assured. Sometimes at this stage, it becomes clear that there is no point in developing the project further because the data is not suitable for work.
When there is confidence in the available data, it is necessary to think about the preprocessing rules. Even if there is a large set of suitable data, there is no guarantee that it does not contain omissions, distorted values, etc. The term "input data quality" and the well-known phrase "Garbage in - garbage out" should be mentioned here.
No matter how good a model is used, it will produce poor results on low-quality data. In practice, many project resources are spent on creating a high-quality dataset.
After the previous stage, it makes sense to consider the metrics of the trained model when conducting experiments. Within the framework of the block under consideration, the experiment consists of linking training and validation of the ML model.
The experiment consists of the classical scheme of training the desired version of the model with the selected set of hyperparameters on the prepared dataset. For this purpose, the dataset itself is divided into training, test, and validation samples:
You can read more about validation samples in
If the model learning metrics are good, the model code and the selected parameters are stored in a corporate repository.
The fundamental goal of the experimentation process is model engineering, which implies the selection of the best algorithm selection and the selection of the best hyperparameter tuning.
The difficulty of conducting experiments is that the developer needs to check many combinations of ML-model operation parameters. And we are not talking about different variants of the used mathematical apparatus.
In general, it takes work. And what to do if the desired metrics are not achieved within the framework of combinations of model parameters?
If the desired metrics of ML-model operation cannot be achieved, you can try to extend the feature description of dataset objects with new features. Due to them, the context for the model will expand, and therefore, the desired metrics may improve.
New features may include:
Let's look at the part of the diagram that relates to Feature Engineering.
Block B1 aims to form a set of requirements for transforming the available source data and obtaining features from them. The calculation of the components itself is performed from these very preprocessed and cleaned data according to the "formulas" entered by the ML developer.
It is essential to say that working with features is iterative. Applying one set of features, a new idea may come to mind, realized in another set of features, and so on, ad infinitum. This is explicitly shown in the diagram as a Feedback Loop.
Block B2 describes the immediate process of adding new features to the data.
Connecting to and retrieving data sources are technical operations that can be pretty complicated. For simplicity of explanation, I will assume that there are several sources to which the team has access and tools to unload data from these sources (at least Python scripts).
Data cleaning and transformation. This stage almost hoes the similar step in the block of experiments (C) - data preparation. Already at the set of the first experiments, there is an understanding of what data and in what format are needed for training ML models. All that remains is to generate and test new features correctly, but the data preparation process for this purpose should be automated as much as possible.
Computation of new features. As noted above, these actions can consist of simply transforming a few elements of a data tuple. Another option is to run a separate large processing pipeline to add a single feature to the same data tuple. Either way, there is a set of actions that are sequentially executed.
Adding the result. The result of the previous actions is added to the dataset. Most often, features are added to the dataset in batch to reduce the database load. But sometimes it is necessary to do it on the fly (streaming) to speed up the execution of business tasks.
It is essential to use the obtained features as efficiently as possible: save and reuse them in tasks of other ML developers of the company. The scheme has a Feature store for this purpose. It should be available inside the company, separately administered, and versioned all the features that get into it. The Feature store has 2 parts: online (for streaming scripts) and offline (for batch scripts).
At the beginning of the article, I indicated that by ML system I mean an information system, one or more components of which contain a trained model that performs some part of the overall business logic. The better the ML model obtained due to development, the greater the effect of its operation. A trained model processes the incoming stream of requests and provides some predictions in response, thus automating some parts of the analysis or decision-making process.
The process of using a model to generate predictions is called inference, and training a model is called training. A clear explanation of the difference between the 2 can be borrowed from Gartner. Here, I will practice on cats.
For the efficient operation of a production ML system, it is vital to keep an eye on the inference metrics of the model. As soon as they start dropping, the model should be either retrained or replaced with a new one. Most often, it happens due to changes in input data (data drift). For example, there is a business problem in which the model can recognize cupcakes in photos, and it is given this as input. The Chihuahua dogs in the example are for balance:
The model in the original dataset does not know anything about the Chihuahua dogs, so it predicts incorrectly. So, it is necessary to change the dataset and conduct new experiments. The new model should be in production as soon as possible. No one forbids users to upload Chihuahua dog images, but they will get the wrong results.
Now to more real-world examples. Let's consider the development of a recommendation system for a marketplace.
Based on the user's purchase history, purchases of similar users, and other parameters, a model or ensemble of models generates a block with recommendations. It contains products whose purchase revenue is regularly counted and tracked.
Something happens, and customers' needs change. Consequently, their recommendations are no longer relevant. Demand for the recommended products drops. All this leads to a decrease in revenue.
Next managers scream and demand everything to be restored by tomorrow, which leads to nothing. Why? There is insufficient data on new customer preferences, so you can't even make a new model. You can take some basic algorithms of recommendation generation (item-based collaborative filtering) and add them to production. In this way, recommendations will somehow work, but this is only a temporary workaround.
Ideally, the process should be set up in such a way that the process of retraining or experimenting with different models is started based on metrics without managers telling them to do so. And the best one would eventually replace the current one in production. In the diagram, this is the Automated ML Workflow Pipeline (block D), which is started by triggers in some orchestration tool.
This is the most heavily loaded section of the scheme. Several key third-party components are involved in the operation of block D:
The structure of the block itself combines the stages of the experimentation and feature development (B2) blocks. It's not surprising, considering that these are the processes that need to be automated. The main differences are in the last 2 stages:
The remaining steps are identical to those described above.
Separately, I want to mention service artifacts that are required by the orchestrator to run model retraining pipelines. This is the code that is stored in the repository and runs on selected servers. It is versioned and upgraded following all the rules of software development. This code implements the model retraining pipelines, and the result depends on its correctness.
More often than not, various ML tools are run within the code, within which the execution of the steps of the pipelines takes place, for example:
It is worth noting here that it is generally impossible to automate experiments. It is possible, of course, to add the AutoML concept to the process. However, there is currently no recognized solution that can be used with the same results for any subject of the experiment.
In the general case, AutoML works like this:
Automation has been dealt with a bit. Next, we need to organize the delivery of a new version of the model to production.
ML model is required to generate predictions. But the ML model itself is a file, which cannot be made to generate them so quickly. You can often find such a solution on the Internet: a team takes FastAPI and writes a Python wrapper around the model so that you can “follow the predicates”.
From the moment the ML model file is received, there are several ways things can unfold. The team can go:
It is a labor-intensive task to do this for all models and to maintain the entire code base in the future. To make it easier, special serving tools have appeared, which introduced 3 new entities:
An Inference Instance, or Inference Service, is a specific ML model prepared to receive queries and generate response predictions. Such an entity can represent a sub in a Kubernetes cluster with a container with the necessary ML model and the technical tooling to run it.
The Inference Server is the creator of Inference Instances/Services. There are many implementations of Inference Server. Each can work with specific ML frameworks, converting the models trained in them into ready-to-use models for processing input queries and generating predictions.
The Serving Engine performs the primary management functions. It determines which Inference Server will be used, how many copies of the resulting Inference Instance should be started, and how to scale them.
In the scheme under consideration, there is no such detailing of model serving components, but similar aspects are outlined:
For an example of a complete stack for Serving, we can refer to the stack from Seldon:
There is even a standardized protocol for implementing Serving, support for which is de facto mandatory in all such tools. It is called V2 Inference Protocol and has been developed by several prominent market players - KServe, Seldon, and Nvidia Triton.
In various articles, you may find references to the tools for Serving and Deploying as a single entity. However, it is essential to understand the difference in the purpose of both. This is a debatable issue, but this article will put it this way:
Many strategies can be used to Deploy models, but these are not ML-specific. By the way, the paid version of Seldon supports several of the strategies, so you can just select this stack and enjoy how everything works.
Remember that model performance metrics must be tracked. Otherwise, you won't be able to solve problems in time. How exactly to track metrics is the big question. Arize AI has built a whole business on this, but nobody has canceled Grafana and VictoriaMetrics.
Given everything written above, it turns out that ML command:
It looks costly and only sometimes justified. Therefore, there is a separate MLOps Project Initiation (A) block in the diagram responsible for rational goal setting.
An example of the IT director's reasoning can demonstrate the way of thinking here. An inspired project manager comes to him and asks for a new platform installation for building an ML system. If both are acting in the best interest of the company, clarifying questions will follow from the IT director:
The IT director will be brought down as a teacher in a university but will save the company's money. If all the questions have been answered, there is a real need for an ML system.
Depends on the problem. If you need to find a one-time solution, for example, PoC (Proof of Concept), you don't need MLOps. If it is essential to process many incoming requests, then MLOps is required. In essence, the approach is similar to optimizing any corporate process.
To justify the need for MLOps to management, you need to prepare answers to the questions:
The next thing to do is to retake the IT director's exam.
The challenges continue because the team must also be convinced of the need to change their work processes and technology stack. Sometimes, this is more difficult than asking management for a budget.
To convince the team, it is worth preparing answers to the questions:
As you can see, this process isn’t simple.
I'm done here with a detailed study of the MLOps scheme. However, these are only theoretical aspects. Practical implementation always reveals additional details that can change a lot of things.
In the next article, I will discuss:
Thank you for your attention!