The workflow for building machine learning models often ends at the evaluation stage: you have achieved an acceptable accuracy, and “ta-da! Mission Accomplished.”
Beyond that, it might just be sufficient to get those nice-looking graphs for your paper or for your internal documentation.
In fact, going the extra mile to put your model into production is not always needed. And even when it is, this task is delegated to a system administrator.
However, nowadays, many researchers/engineers find themselves responsible for handling the complete flow from conceiving the models to serving them to the outside world. Whether it is a university project or a personal experiment, demonstrating our work is generally a great way to get a wider audience interested. Few people would make the extra effort to use a system for which the value is not instantly perceived.
In this article, we’ll go together through this workflow; a process that I had to repeatedly do myself. The assumption is that you have already built a machine learning or deep learning model, using your favorite framework (scikit-learn, Keras, Tensorflow, PyTorch, etc.). Now you want to serve it to the world at scale via an API.
By “at scale”, we’re not talking about an industrial scale of a huge company. The goal is to make the best out of that server with lots of CPUs and a large RAM, sitting idly at your institution or in the cloud. This entails serving multiple requests at a time, spawning new processes as the load increases, and reducing the number of processes as load decreases. You also want the additional guarantee that your server would be restarted after unexpected system failures.
If that is what you have in mind, let’s go through it together.
We’ll be considering the context of Python-based frameworks on Linux servers. Our setup will involve:
Nginx will be our interface to the internet, and it will be the one handling clients’ requests. Nginx has native support for the binary uWSGI protocol, and they communicate via Unix sockets. In turn, the uWSGI server will be invoking a callable object within our Flask application directly. That is the way that requests will be served.
The complete workflow illustrated
A few notes at the beginning of this tutorial:
To begin with, we need to install the systemd
and nginx
packages:
sudo apt-get install systemd nginx
Next, we have to install Anaconda by following the instructions on the official site, which consist of downloading the executable, running it, and adding Anaconda to your system’s PATH. Below, we will assume that Anaconda is installed under the home directory.
All the code and configuration files in this article are available from the accompanying Github repository. But make sure you follow the steps below to get the full workflow.
harkous/production_ml_production_ml — Scaling Machine Learning Models in Production_github.com
Next, let’s create the isolated Anaconda environment from the environment.yml
file. Here is how this file looks like (it already contains several of the frameworks we’ll be using):
We run the following to create the environment:
conda env create --file environment.yml
When we want to activate this environment, we run:
source activate production_ml_env
By now, we have Keras installed, alongside flask, uwsgi, uwsgitop, etc. So we are ready to get started.
For the purposes of this tutorial, we will not dive deeply into how to build your ML model. Instead, we will adapt the example of topic classification using the Reuters newswire dataset bundled within Keras. This is the code for building the classifier:
To replicate the setup we use here, simply run the following to train a model without a GPU:
export CUDA_VISIBLE_DEVICES=-1KERAS_BACKEND=theano python build_classifier.py
This will create a model file reuters_model.hdf5
in the folder models
. Now, we are ready to serve the model via Flask on port 4444
. In the code below, we provide a single REST endpoint /predict
that supports GET
requests, where the text to classify is provided as a parameter. The returned JSON is of the form{"prediction": "N"}
, where N
is an integer representing the predicted class.
To start the Flask application server, we run:
python app.py
Voila! Now we have the simple, lightweight server running.
You can test the server with your favorite REST client (e.g., Postman) or by simply going to this URL in your web browser (replace **your_server_url**
by your server’s URL): [http://**your_server_url**](http://your_server_url):4444/predict?text=this is a news sample text about sports and football in specific
You should get back a response as
{"class": "11"}
Now, we are off to scaling our simple application server. uWSGI will be the key here. It communicates with our Flask application by invoking the callable object app
in the file app.py
. uWSGI includes most of the parallelization features we are after. Its configuration file looks as follows:
On your side, you have to modify the option **my_app_folder**
to be the folder of your own app directory and the option **my_user**
to be your own username. Depending on your needs and file locations, you might need to modify/add other options too.
One important section in the uwsgi.ini
is the part where we use the Cheaper subsystem in uWSGI, which allows us to run multiple workers in parallel to serve multiple concurrent requests. This is one of the cool features of uWSGI, where dynamically scaling up and down is attainable with a few parameters. With the above configuration, we will have at least 5 workers at all times. Upon load increase, Cheaper will allocate 3 additional workers at a time until all the requests find available workers. The maximum number of workers above is set to 50.
In your case, the best configuration options depend on the number of cores in the server, the total memory available, and the memory consumption of your application. Take a look at the official docs for advanced deployment options.
We’re almost there. If we start uWSGI now (which we’ll do in a while), it will take care of invoking the app from the file app.py
, and we will benefit from all the scaling features it provides. However, in order to get REST requests from the internet and to pass them to the Flask app via uWSGI, we will be configuring nginx.
Here is a barebone configuration file for nginx, with just the part we depend on for this application. Of course, nginx can be additionally used for configuring SSL or to serve static files, but that is out of the scope of this article.
We place this file in /etc/nginx/sites-available/nginx_production_ml
(you will need sudo access for that). Then, to enable this nginx configuration, we link it to the sites-enabled
directory:
sudo ln -s /etc/nginx/sites-available/nginx_production_ml /etc/nginx/sites-enabled
We restart nginx:
sudo service nginx restart
Finally, we will launch the uWSGI server we configured earlier. However, in order to ensure that our server does not die forever after system restarts or unexpected failures, we will launch it as a systemd
service. Here is our service configuration file, which we place in the /etc/systemd/system
directory using:
sudo vi /etc/systemd/system/production_ml.service
We start the service with:
sudo systemctl start production_ml
.service
To allow this service to start when the machine is rebooted:
sudo systemctl enable production_ml
.service
At this stage, our service should start successfully. If we update the service later, we simply have to restart it:
sudo systemctl restart production_ml
.service
To keep an eye on the service and see the load per worker, we can use [uwsgitop](https://github.com/xrmx/uwsgitop])
. In uwsgi.ini
, we have already configured a stats socket within our application folder. To see the stats, execute the following command in that folder:
uwsgitop stats.production_ml.sock
Here is an example of the workers in action, with additional workers that have already been spawned. To simulate such a heavy load on your side, even with simple tasks, you can artificially add a time.sleep(3)
in the prediction code.
One way to send concurrent requests to your server is using curl (remember to replace YOUR_SERVER_NAME_OR_IP
by your server’s URL or IP address.
In order to monitor the log of the application itself, we can make use of journalctl
:
sudo journalctl -u production_ml.service -f
Your output should look like this:
If you have reached this stage, and your application has run successfully, then this article has served its purpose. Some additional notes are worth mentioning though at this stage:
**lazy-apps**
mode in uwsgi: To keep this article general enough, we have used the lazy-apps
mode in uwsgi, which loads the application one time per worker. According to the docs, this will require O(n) time to load (where n is the number of workers). It also likely requires more memory but results in a clean environment per worker. By default, uWSGI loads the whole application differently. It starts with one process; then it forks itself multiple times for additional workers. This results in more memory savings. However, this does not play well with all of the ML frameworks. For example, the TensorFlow backend in Keras fails without the lazy-apps
mode (e.g., check this, this, and this). The best could be to try first without lazy-apps = true
, and shifting to it if you encounter similar issues.app
as a callable, the parameters of the app itself should not be passed via the command line. You’re better off using a configuration file with the likes of configparser to read such parameters.So that’s it! I hope this guide and the associated repository will be helpful for all those trying to deploy their models into production as part of a web application or as an API. If you have any feedback, feel free to drop a comment below.
And thanks for reading! You might also be interested in checking my other articles on my Medium page:
Hamza Harkous — Medium_Read writing from Hamza Harkous on Medium. Postdoc at EPFL, Switzerland; working at the intersection of Privacy, NLP…_medium.com
… or in seeing what I tweet about:
Hamza Harkous (@hamzaharkous) | Twitter_The latest Tweets from Hamza Harkous (@hamzaharkous). Postdoc @EPFL; working at the intersection of Privacy, NLP, and…_twitter.com