One of the known truths of the Machine Learning(ML) world is that it takes a lot longer to deploy ML models to production than to develop it.¹
The problem of deploying ML models to production is well known. Modern software requires a variety of crucial properties such as on-demand scaling and high availability. As a result, it might take a lot of effort and time to correctly deploy models into productions.
Let’s discuss some different options you have when it comes to deploying ML models. Variants are provided in order from the most general to ML-specific.
The most direct way to deploy anything is to rent a VM, wrap a model into some kind of a server and leave it running. While being extremely straightforward and customizable this method has numerous drawbacks such as hard integration into CI/CD pipelines and isolation problems.
It is possible to deploy ML models in Docker containers using Kubernetes or similar orchestration tools. This option provides way more quality of life improvements. Models can be easily wrapped into specially designed servers such as NVIDIA Triton or Tensorflow Runtime (works for VM option as well). Now it is even easier to chain models together using highly sophisticated frameworks such as Kubeflow.
However, customizability comes at a cost of DevOps complexity and a requirement to maintain technologies that make your model run.
An easy way to just drop your model on the cloud would be using serverless PaaS platforms. Here you have to wrap your model into some preprocessing and postprocessing code.
Platforms like Heroku or Google App Engine provide more flexibility since you can even wrap your code into a container while AWS Lambda functions, Google Cloud Functions or Azure Function make it much easier to deploy, even providing great integrations into respective cloud services.
This approach is great for background task processing since inference time is relatively high because you are limited to processing models on CPU and models themself are commonly stored far away from processing nodes and may require time to load.
Now we are seeing a rise of ML-focused serverless providers that host your model providing an API or a set of frameworks. One set of providers would be Amazon SageMaker or Google Cloud AI Platform. These services still require you to rent underlying compute instances on which your models will be running.
In general, using ML-focused serverless providers allows you to separate GPU-intensive computations from CPU-intensive while providing on-demand scalability for the former. However, you still have to perform pre and post-processing on the client application or using cloud functions.
We successfully reviewed common options to deploy ML models.
Thank you for reading! Stay tuned for more articles and feel free to write in the comment section or ask questions on [email protected]
One of the biggest underrated challenges in machine learning development is the deployment of the trained models in production that too in a scalable way. One joke on it I have read is “Most common way, Machine Learning gets deployed today is powerpoint slides :)”.