Finally! It is time for the 2nd part of my detailed guide on MLOps! Congratulations!
If you regularly read articles on MLOps, you begin to form a particular perception of the context. Thus, the authors of the texts mainly write about working with 3 types of artifacts:
This explains the essence of MLOps. ML-team should create a code base, due to which an automated and repeatable process will be realized:
Now, let`s discuss this in more detail.
If you have read my previous article, you have already seen this MLOps diagram. Here we can find the following data sources:
It is highly debatable to call this list sources, but the conceptual intent seems clear. Some data is stored in a large number of systems and processed in different ways. All of them may be required for an ML model.
How to make sure the correct data gets into the ML system:
Eventually, the company can have a full-fledged Data Platform with ETL/ELT, data buses, object stores, and other stuff.
The critical aspect of using data within MLOps is automating the preparation of high-quality datasets for training ML models.
Now, let's look for artifacts that are relevant to ML models:
Similar to data, you need tools to help:
A key aspect of working with models within MLOps is automating the process of retraining models to achieve better quality metrics of their performance on customer queries.
Code is the easiest. It just automates the processes of working with data and models.
We can find the following artifacts relevant to the code:
You can only add infrastructure as a code (IaaC), with which all the necessary infrastructure is lifted.
It should be noted that sometimes there is additional code for orchestration, mainly if multiple orchestrators are used in a team. For example, Airflow, which runs the DAG in Dagster.
In this image, I see several types of computing infrastructure in use:
Moreover, the model serving computational infrastructure is used both for conducting experiments and retraining models within the automated pipelines' framework. This approach is possible if the utilization of computational infrastructure has a reserve for the simultaneous execution of these processes.
In the first stages, all tasks can be solved within one infrastructure, but the need for new capacities will grow in the future. In particular, due to the specific requirements for configurations of computing resources:
The ubiquity of Kubernetes as an infrastructure platform for ML systems adds to the complexity. All computational resources need to be able to be utilized in k8s.
The big scheme of MLOps is just the top level of abstraction that we have to deal with.
After looking at such an extensive MLOps scheme and the artifacts mentioned above, there is no desire to build something similar in your own company. You have to select and implement many tools, prepare the necessary infrastructure for them, teach the team how to work with all of this and maintain all of the above.
The most essential thing in MLOps is to get started. You should only implement some MLOps components at a time if there is no business need for them. Guided by maturity models, you can create a base around which the ML platform will be developed in the future.
It is likely that many components will never be needed to achieve business goals. This idea is already being actively promoted in various articles about reasonable-scale MLOps and medium-scale MLOps.
Let's mention ModelOps. Is it part of MLOps or is it MLOps part of ModelOps? Here's an article that does a great job of answering that question.
Viewing the MLOps platform as a simple set of ML software components is tempting. It shouldn't. There are many important aspects whose implementation directly affects the performance of all ML processes.
As a whole model, we can refer to the concept of an information system. If we study the architectural aspects of information systems, we can arrive at the following set of their constituent elements:
I want to add one more element - knowledge, but it is more about preserving the experience of the team in the process of building and operating the ML platform. It is great when employees have built a platform and successfully use it. It is great when this experience can be passed on to new employees, and they will be able to develop the platform further. And they will understand what decisions were made during its development and why.
To describe the complexity of the ML platform in more detail, it is necessary to consider each information system element.
Not all market players consciously approach the choice of hardware for the ML-system. This is especially noticeable in terms of GPU selection. Often, the hardware on which the pipeline is built is used. Then this pattern just flows into the following projects. However, this is sometimes good. For example, it is justified if it is costly to rewrite the pipeline.
If a small company can buy a server for ML, it will most likely choose a machine with a consumer GPU (RTX 2000/3000).
Now let's analyze this choice:
GPU memory is understood. But how do you select CPU and RAM?
You can look at examples of server configurations from providers, but this does not guarantee that they will be suitable for the company's specific tasks. Surveys of companies show that the size of RAM is adjusted to the size of the maximum dataset to increase the speed of its loading into the GPU. GPU is the most expensive resource in a server, so ideally, it should not be idle.
CPU is most often needed for preprocessing and loading data into RAM. Depending on the specifics of the experiment, either all CPU cores or a small part of them may be utilized. Some people allocate from 12 to 24 vCPUs for each high-performance GPU, while 8 is enough for some people. The exact number of cores can be obtained only by experiments.
An additional headache is the need to use the GPU in Kubernetes since it is the de facto standard for implementing ML systems. The headache is the need to turn the GPU into an available resource. With this, it will be possible to use the graphics card. To do this, you must deal with the Nvidia GPU Operator or its separate component - Device Plugin. And while we're at it, annotating nodes with labels correctly is a good idea. Not even all Kubernetes experts have heard about it.
The conversation about software needs to be divided into several parts:
ML-specific refers to software solutions that implement parts of the functionality of MLOps schema components. ML part can be described as following software components.
The experimentation process can be built with:
Automation of model retraining can be implemented using:
Serving and Deploying models can be implemented using:
If a project already requires elaborate work with features, the most popular solution is Feast. It has alternatives. However, more often, companies write something of their own, as the performance of ready-made solutions is not enough.
There is a whole stack of ML components that you need to know how to customize, maintain, and operate. The "customize and maintain" part requires the use of additional software. For example:
In addition, we should remember various security certificates, domain names, databases for used components, cascading caches for datasets, and much more.
Learning and configuring everything on a single system can take a lot of time and effort. Just putting ClearML on a server won't do the trick.
One more time about data. Here, I will refer to different implementations of existing data storage platforms, so that those interested can learn with MLOps the now fashionable terminological vocabulary of Data engineers.
So, data for ML can be stored in complex architectures:
Building them takes a considerable amount of resources.
Now, Kubernetes clusters are quite often located at a specific provider. Some teams have in-house infrastructure, but even there, it is in a separate server room, quite remote from the group.
The problem is that the data on which the model is trained must be uploaded to a specific server where the training process takes place. If the network channel is not wide enough, you must wait a long time for the necessary data to be loaded. This dramatically increases the waiting time for the results of experiments.
It is necessary to invent different network architectures to make the data transfer faster or to introduce caches into the cluster. If the data is in-house and the computational resources are with the provider, a separate network channel may be required to maintain performance with such a distributed architecture.
Here, you need up-to-date instructions and regulations by which the team will work and new participants will learn.
This is where ITIL comes to mind with its IT Service Management best practices, but there is yet to be an alternative in the world of MLOps. If CRISP-DM already looks outdated, CRISP-ML still needs to be developed. You can find its list of processes, but there needs to be a more detailed description of the methods themselves.
To start with, I can suggest thinking about the following regulations:
In any case, each team builds its customized processes, even if it initially uses an off-the-shelf approach.
If you don't have a dedicated role on your command that is responsible for making the ML technology stack work, don't despair. It's still common right now. MLOps engineers can still be considered unicorns for now, almost like Angular developers.
Today, it is more common to give the task of implementing ML systems to DevOps specialists. If they are responsible for CI/CD, let them take over ML pipelines as well. Naturally, this approach has some disadvantages:
At a minimum, DevOps specialists need a lot of time to learn all the processes and technologies. In one version of the ideal world, the MLOps specialist is the next stage of development of a DevOps engineer. It is when all Terraform files have already been written, configured with Ansible, and run in Kubernetes via Argo CD.
It is useful when an MLOps engineer understands the world of ML development, knows the complexities, and can reasonably adjust pipelines. Thus, in another version of the ideal world, an MLOps engineer is an ML developer who is not only willing to teach ML models but also to deal with infrastructure.
Currently, an ideal MLOps-engineer is presented as a "dragon warrior": student, teacher, data scientist, backend developer, ML engineer, data engineer, DevOps, software developer, and the rest in one person. But where to find such a person?
If you made it to the end, you are a strong-minded person who is really interested in ML and MLOps!