paint-brush
An Architect's Guide to Machine Learning Operations and Required Data Infrastructureby@minio
2,437 reads
2,437 reads

An Architect's Guide to Machine Learning Operations and Required Data Infrastructure

by MinIO10mSeptember 5th, 2024
Read on Terminal Reader
Read this story w/o Javascript

Too Long; Didn't Read

MLOps is a set of practices and tools aimed at addressing the specific needs of engineers building models and moving them into production. Some organizations start off with a few homegrown tools that version datasets after each experiment and checkpoint models after every epoch of training. Many organizations have chosen to adopt a formal tool that has experiment tracking, collaboration features, model serving capabilities, and even pipeline features.
featured image - An Architect's Guide to Machine Learning Operations and Required Data Infrastructure
MinIO HackerNoon profile picture


MLOps, short for Machine Learning Operations, is a set of practices and tools aimed at addressing the specific needs of engineers building models and moving them into production. Some organizations start off with a few homegrown tools that version datasets after each experiment and checkpoint models after every epoch of training. On the other hand, many organizations have chosen to adopt a formal tool that has experiment tracking, collaboration features, model serving capabilities, and even pipeline features for processing data and training models.


To make the best choice for your organization, you should understand all the capabilities available from the leading MLOps tools in the industry. If you go the homegrown route, you should understand the capabilities you are giving up. A homegrown approach is fine for small teams that need to move quickly and may not have time to evaluate a new tool. If you choose to implement a third-party tool, then you will need to pick the tool that best matches your organization's engineering workflow. This could be tricky because the top tools today vary significantly in their approach and capabilities. Regardless of your choice, you will need data infrastructure that can handle large volumes of data and serve training sets in a performant manner. Checkpointing models and versioning large datasets require scalable capacity, and if you are using expensive GPUs, you will need performant infrastructure to get the most out of your investment.


In this post, I will present a feature list that architects should consider regardless of the approach or tooling they choose. This feature list comes from my research and experiments with three of the top MLOps vendors today - KubeFlowMLflow, and MLRun. For organizations that chose to start off with a homegrown solution I will present a data infrastructure that can scale and perform. (Spoiler alert - all you need here is MinIO.) When it comes to third-party tools, I have noticed a pattern with the vendors I have researched. For organizations that choose to adopt MLOps tooling, I will present this pattern and tie it back to our Modern Datalake Reference Architecture.


Before diving into features and infrastructure requirements, let’s better understand the importance of MLOps. To do this, it is helpful to compare model creation to conventional application development.

The Difference between Models and Applications

Conventional application development, like implementing a new microservice that adds a new feature to an application, starts with reviewing a specification. New data structures or changes to existing data structures are designed first. The design of the data should not change once coding begins. The service is then implemented, and coding is the main activity in this process. Unit tests and end-to-end tests are also coded. These tests prove that the code is not faulty and correctly implements the specification. They can be run automatically by a CI/CD pipeline before deploying the entire application.


Creating a model and training it is different. The first step is understanding the raw data and the needed prediction. ML engineers do have to write some code to implement their neural networks or set up an algorithm, but coding is not the dominant activity. The main activity is repeated experimentation. During experimentation, the design of the data, the design of the model, and the parameters used will all change. After every experiment, metrics are created that show how the model performed as it was trained. Metrics are also generated to determine model performance against a validation set and a test set. These metrics are used to prove the quality of the model. You should save the model after every experiment, and every time you change your datasets, you should save them as well. Once a model is ready to be incorporated into an application, it must be packaged and deployed.


To summarize, MLOps is to machine learning what DevOps is to traditional software development. Both are a set of practices and principles aimed at improving collaboration between engineering teams (the Dev or ML) and IT operations (Ops) teams. The goal is to streamline the development lifecycle, from planning and development to deployment and operations, using automation. One of the primary benefits of these approaches is continuous improvement.


Let’s go a little deeper into MLOps and look at specific features to consider.

10 MLOps Features to Consider

Experiment tracking and collaboration are the features most associated with MLOps, but today's more modern MLOps tools can do much more. For example, some can provide a runtime environment for your experiments. Others can package and deploy models once they are ready to be integrated into an application.


Below is a superset of features found in MLOps tools today. This list also includes other things to consider, such as support and data integration.


  1. Support from a major player - MLOps techniques and features are constantly evolving. You want a tool that is backed by a major player (Google, Databricks, or McKinsey and Company back Kubeflow, MLflow, and MLRun, respectively), ensuring constant development and improvement. As a concrete example, many popular tools today were created before large language models (LLMs); consequently, many are adding new features to support generative AI.


  2. Modern Datalake Integration - Experiments generate a lot of structured and unstructured data. An MLOps tool that is perfectly integrated with the Modern Datalake (or Data Lakehouse) would store unstructured data in the Data Lake (this is MinIO directly), and structured data would go into the Data Warehouse. Unfortunately, many MLOps tools were around before the Open Table Formats that gave rise to the Modern Datalake, so most will have a separate solution for their structured data. This is typically an open-source relational database that your data infrastructure will need to support. With respect to unstructured data (datasets and model checkpoints), all the major tools in the industry use MinIO since we have been around since 2014.


  3. Experiment Tracking - Probably the most important feature of an MLOps tool is keeping track of datasets, models, hyperparameters, and metrics for each experiment. Experiment tracking should also facilitate repeatability - if you got a desirable result five experiments ago and the experiments afterward degraded the performance of your model, then you should be able to use your MLOps tool to go back and get the exact hyperparameters, and dataset features used that produce the desirable result.


  4. Facilitate Collaboration—An important component of an MLOps tool is the portal or UI used to present the results of each experiment. This portal should be accessible to all team members so that they can see each other's experiments and make recommendations. Some MLOps tools have fancy graphical features that allow for custom graphs to be created comparing results from experiments.


  5. Model Packaging - This capability packages a model such that it is accessible from other programming environments - typically as a microservice. This is a nice feature to have. A trained model is nothing more than a serialized object. Many organizations may have this figured out already.


  6. Model Serving - Once a model is packaged as a service, this feature will allow for the automated deployment of the service containing the model to the organization’s formal environments. You will not need this feature if you have a mature CI/CD pipeline capable of managing all software assets across environments.


  7. Model Registry - A model registry provides a view of all the models currently under management by your MLOps tool. After all, the creation of production-grade models is the goal of all MLOps. This view should show models that got deployed to production as well as models that never made it into production. Models that made it into production should be tagged in such a way that you can also determine the version of the application or service that they were deployed into.


  8. Serverless Functions - Some tools provide features that allow code to be annotated so that a function or module can be deployed as a containerized service for running experiments in a cluster. If you decide to use this feature, then make sure all your engineers are comfortable with this technique. It can be a bit of a learning curve - engineers with a DevOps background will have an easier time, while engineers who previously studied machine learning with little coding experience will struggle.


  9. Data Pipeline Capabilities - Some MLOps tools aim to provide complete end-to-end capabilities and have features specific to building data pipelines for retrieving raw data, processing it, and storing clean data. Pipelines are usually specified as Directed Acyclic Graphs (DAGs) - some tools also have scheduling capabilities. When used in conjunction with serverless functions this can be a powerful low-code solution to developing and running data pipelines. You will not need this if you are already using a pipeline or workflow tool.


  10. Training Pipeline Capabilities - This is similar to data pipelines, but a training pipeline picks up where data pipelines leave off. A training pipeline allows you to call your data access code, send data to your training logic, and annotate data artifacts and models so that they are automatically saved. Similar to data pipelines, this feature can be used in conjunction with serverless functions to create DAGs and schedule experiments. If you are already using a distributed training tool, then you may not need this feature. It is possible to start distributed training from a training pipeline, but this could be too complex.

MLOps and Storage

After looking at the differences between traditional application development and machine learning, it should be clear that to be successful with machine learning, you need some form of MLOps and a data infrastructure capable of performance and scalable capacity.


Homegrown solutions are fine if you need to start a project quickly and do not have time to evaluate a formal MLOps tool. If you take this approach, the good news is that all you need for your data infrastructure is MinIO. MinIO is S3 compatible so if you started with another tool and used an S3 interface to access your datasets, then your code will just work. If you are starting out then you can use our Python SDK, which is also S3 compatible. Consider using the enterprise version of MinIO, which has caching capabilities that can greatly speed up data access for training sets. Check out The Real Reasons Why AI is Built on Object Storage where we dive into how and why MinIO is used to support MLOps. Organizations that choose a homegrown solution should still familiarize themselves with the ten features described above. You may eventually outgrow your homegrown solution, and the most efficient way forward is to adopt an MLOps tool.


Adopting a third-party MLOps tool is the best way to go for large organizations with several AI/ML teams creating models of different types. The MLOps tool with the most features is not necessarily the best tool. Look at the features above and make note of the features that you need, the features you currently have as part of your existing CI/CD pipeline, and finally, the features you do not want, this will help you find the best fit. MLOps tools have a voracious appetite for large petabytes of object storage. Many of them automatically version your datasets with each experiment and automatically checkpoint your models after each epoch. Here again, MinIO can help since capacity is not a problem. Similar to the homegrown solution, consider using the enterprise edition of MinIO. The caching features work automatically once configured for a bucket so even though the MLOps tool does not request the use of the cache - MinIO will automatically cache frequently accessed objects like a training set.

A Wishlist for the Future

Many of the MLOps tools on the market today use an open-source relational database to store the structured data generated during model training which is usually metrics and hyperparameters. Unfortunately, this will be a new database that needs to be supported by your organization. Additionally, if an organization is moving toward a Modern Datalake (or Data Lakehouse) then an additional relational database is not needed. What would be nice for major MLOps vendors to consider is using an OTF-based data warehouse to store their structured data.


All the major MLOps vendors use MinIO under the hood to store unstructured data. Unfortunately, this is generally deployed as a separate small instance that is installed as a part of the overall larger installation of the MLOps tool. Additionally, it is usually an older version of MinIO, which goes against our ethos of always running the latest and greatest. For existing MinIO customers, it would be nice to allow the MLOps tool to use a bucket within an existing installation. For customers new to MinIO, the MLOps tool should support the latest version of MinIO. Once installed, MinIO can also be used for purposes within your organization beyond MLOps resources, namely anywhere the strengths of object storage are required.

Conclusion

In this post, I presented an architect's guide to MLOps by investigating both MLOps features and the data infrastructure needed to support these features. At a high level, organizations can build a homegrown solution, or they can deploy a third-party solution. Regardless of the direction chosen, it is important to understand all the features available in the industry today. Homegrown solutions allow you to start a project quickly, but you may soon outgrow your solution. It is also important to understand your specific needs and how MLOps will work with an existing CI/CD pipeline. Many MLOps tools are feature-rich and contain features that you may never use or that you already have as part of your CI/CD pipeline.


To successfully implement MLOps, you need a data infrastructure that can support it. In this post, I presented a simple solution for those who chose a homegrown solution and described what to expect from third-party tools and the resources they require.


I concluded with a wish list for further development of MLOps tools that would help them to better integrate with the Modern Datalake.


For more information on using the Modern Datalake to support AI/ML workloads, check out AI/ML Within A Modern Datalake.


If you have any questions, be sure to reach out to us on Slack!