A Guide to Scaling Machine Learning Models in Productionby@harkous
30,402 reads
30,402 reads

A Guide to Scaling Machine Learning Models in Production

by harkousDecember 25th, 2017
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

A Guide to Scaling Machine Learning Models in Production is not always needed. Many researchers/engineers find themselves responsible for handling the complete flow from conceiving the models to conceiving them to serving them to the outside world. We’ll be considering the context of Python-based frameworks on Linux servers on a Linux server. We will only consider the case of serving models over CPU, rather than GPUs, in this tutorial. Most components above can be easily replaced by equivalent components with little to no change in the rest of the steps.

People Mentioned

Mention Thumbnail

Companies Mentioned

Mention Thumbnail
Mention Thumbnail

Coin Mentioned

Mention Thumbnail
featured image - A Guide to Scaling Machine Learning Models in Production
harkous HackerNoon profile picture

The workflow for building machine learning models often ends at the evaluation stage: you have achieved an acceptable accuracy, and “ta-da! Mission Accomplished.

Beyond that, it might just be sufficient to get those nice-looking graphs for your paper or for your internal documentation.

In fact, going the extra mile to put your model into production is not always needed. And even when it is, this task is delegated to a system administrator.

However, nowadays, many researchers/engineers find themselves responsible for handling the complete flow from conceiving the models to serving them to the outside world. Whether it is a university project or a personal experiment, demonstrating our work is generally a great way to get a wider audience interested. Few people would make the extra effort to use a system for which the value is not instantly perceived.

In this article, we’ll go together through this workflow; a process that I had to repeatedly do myself. The assumption is that you have already built a machine learning or deep learning model, using your favorite framework (scikit-learn, Keras, Tensorflow, PyTorch, etc.). Now you want to serve it to the world at scale via an API.

By “at scale”, we’re not talking about an industrial scale of a huge company. The goal is to make the best out of that server with lots of CPUs and a large RAM, sitting idly at your institution or in the cloud. This entails serving multiple requests at a time, spawning new processes as the load increases, and reducing the number of processes as load decreases. You also want the additional guarantee that your server would be restarted after unexpected system failures.

If that is what you have in mind, let’s go through it together.

We’ll be considering the context of Python-based frameworks on Linux servers. Our setup will involve:

  • Anaconda: for managing package installation and creating an isolated Python 3 environment.
  • Keras: a high-level neural networks API, that is capable of running on top of TensorFlow, CNTK, or Theano.
  • Flask: a minimalistic python framework for building RESTful APIs. Despite being easy to use, Flask’s built-in server serves only one request at a time by default; hence it is not suitable on its own for deployment in production.
  • nginx: the highly stable web server, which provides benefits such as load-balancing, SSL configuration, etc.
  • uWSGI: a highly configurable WSGI server (Web Server Gateway Interface) that allows forking multiple workers to serve multiple requests at a time.
  • systemd: an init system used in multiple Linux distributions to manage system processes after booting.

Nginx will be our interface to the internet, and it will be the one handling clients’ requests. Nginx has native support for the binary uWSGI protocol, and they communicate via Unix sockets. In turn, the uWSGI server will be invoking a callable object within our Flask application directly. That is the way that requests will be served.

The complete workflow illustrated

A few notes at the beginning of this tutorial:

  • Most components above can be easily replaced by equivalent components with little to no change in the rest of the steps. For example, Keras can be easily replaced with PyTorch, Flask can be easily replaced with Bottle, and so on.
  • We will only consider the case of serving models over CPU. The typical case is having access to a server with a lot of CPU cores and trying to make use of these cores to serve the models. GPUs, on the other hand, are more expensive to get in large numbers. Moreover, depending on your application, the speed gain you get from using a GPU at prediction time might not be that significant (especially in NLP applications).

Setting up the Environment

To begin with, we need to install the systemd and nginx packages:

sudo apt-get install systemd nginx

Next, we have to install Anaconda by following the instructions on the official site, which consist of downloading the executable, running it, and adding Anaconda to your system’s PATH. Below, we will assume that Anaconda is installed under the home directory.

All the code and configuration files in this article are available from the accompanying Github repository. But make sure you follow the steps below to get the full workflow.

harkous/production_ml_production_ml — Scaling Machine Learning Models in

Next, let’s create the isolated Anaconda environment from the environment.yml file. Here is how this file looks like (it already contains several of the frameworks we’ll be using):

We run the following to create the environment:

conda env create --file environment.yml

When we want to activate this environment, we run:

source activate production_ml_env

By now, we have Keras installed, alongside flask, uwsgi, uwsgitop, etc. So we are ready to get started.

Building the Flask Web App

For the purposes of this tutorial, we will not dive deeply into how to build your ML model. Instead, we will adapt the example of topic classification using the Reuters newswire dataset bundled within Keras. This is the code for building the classifier:

To replicate the setup we use here, simply run the following to train a model without a GPU:


This will create a model file reuters_model.hdf5 in the folder models. Now, we are ready to serve the model via Flask on port 4444. In the code below, we provide a single REST endpoint /predict that supports GET requests, where the text to classify is provided as a parameter. The returned JSON is of the form{"prediction": "N"}, where N is an integer representing the predicted class.

To start the Flask application server, we run:


Voila! Now we have the simple, lightweight server running.

You can test the server with your favorite REST client (e.g., Postman) or by simply going to this URL in your web browser (replace **your_server_url** by your server’s URL): [http://**your_server_url**](http://your_server_url):4444/predict?text=this is a news sample text about sports and football in specific

You should get back a response as

{"class": "11"}

Configuring the uWSGI Server

Now, we are off to scaling our simple application server. uWSGI will be the key here. It communicates with our Flask application by invoking the callable object app in the file uWSGI includes most of the parallelization features we are after. Its configuration file looks as follows:

On your side, you have to modify the option **my_app_folder** to be the folder of your own app directory and the option **my_user** to be your own username. Depending on your needs and file locations, you might need to modify/add other options too.

One important section in the uwsgi.ini is the part where we use the Cheaper subsystem in uWSGI, which allows us to run multiple workers in parallel to serve multiple concurrent requests. This is one of the cool features of uWSGI, where dynamically scaling up and down is attainable with a few parameters. With the above configuration, we will have at least 5 workers at all times. Upon load increase, Cheaper will allocate 3 additional workers at a time until all the requests find available workers. The maximum number of workers above is set to 50.

In your case, the best configuration options depend on the number of cores in the server, the total memory available, and the memory consumption of your application. Take a look at the official docs for advanced deployment options.

Connecting uWSGI with nginx

We’re almost there. If we start uWSGI now (which we’ll do in a while), it will take care of invoking the app from the file, and we will benefit from all the scaling features it provides. However, in order to get REST requests from the internet and to pass them to the Flask app via uWSGI, we will be configuring nginx.

Here is a barebone configuration file for nginx, with just the part we depend on for this application. Of course, nginx can be additionally used for configuring SSL or to serve static files, but that is out of the scope of this article.

We place this file in /etc/nginx/sites-available/nginx_production_ml (you will need sudo access for that). Then, to enable this nginx configuration, we link it to the sites-enabled directory:

sudo ln -s /etc/nginx/sites-available/nginx_production_ml /etc/nginx/sites-enabled

We restart nginx:

sudo service nginx restart

Configuring the systemd Service

Finally, we will launch the uWSGI server we configured earlier. However, in order to ensure that our server does not die forever after system restarts or unexpected failures, we will launch it as a systemd service. Here is our service configuration file, which we place in the /etc/systemd/system directory using:

sudo vi /etc/systemd/system/production_ml.service

We start the service with:

sudo systemctl start production_ml.service

To allow this service to start when the machine is rebooted:

sudo systemctl enable production_ml.service

At this stage, our service should start successfully. If we update the service later, we simply have to restart it:

sudo systemctl restart production_ml.service

Monitoring the Service

To keep an eye on the service and see the load per worker, we can use [uwsgitop](]). In uwsgi.ini, we have already configured a stats socket within our application folder. To see the stats, execute the following command in that folder:

uwsgitop stats.production_ml.sock

Here is an example of the workers in action, with additional workers that have already been spawned. To simulate such a heavy load on your side, even with simple tasks, you can artificially add a time.sleep(3) in the prediction code.

One way to send concurrent requests to your server is using curl (remember to replace YOUR_SERVER_NAME_OR_IP by your server’s URL or IP address.

In order to monitor the log of the application itself, we can make use of journalctl:

sudo journalctl -u production_ml.service -f

Your output should look like this:

Final Notes

If you have reached this stage, and your application has run successfully, then this article has served its purpose. Some additional notes are worth mentioning though at this stage:

  • **lazy-apps** mode in uwsgi: To keep this article general enough, we have used the lazy-apps mode in uwsgi, which loads the application one time per worker. According to the docs, this will require O(n) time to load (where n is the number of workers). It also likely requires more memory but results in a clean environment per worker. By default, uWSGI loads the whole application differently. It starts with one process; then it forks itself multiple times for additional workers. This results in more memory savings. However, this does not play well with all of the ML frameworks. For example, the TensorFlow backend in Keras fails without the lazy-apps mode (e.g., check this, this, and this). The best could be to try first without lazy-apps = true, and shifting to it if you encounter similar issues.
  • Parameters of the Flask App: Because uWSGI invokes app as a callable, the parameters of the app itself should not be passed via the command line. You’re better off using a configuration file with the likes of configparser to read such parameters.
  • Scaling across multiple servers: The guide above does not discuss the case of multiple servers. Luckily enough, this can be achieved without a significant change in our setup. Benefiting from the load balancing feature in nginx, you can setup multiple machines, each with the uWSGI setup we described above. Then, you can configure nginx to route the requests to the different servers. nginx comes with multiple methods to distribute the load, ranging from a simple round-robin to accounting for the number of connections or the average latency.
  • Port selection: The above guide uses port 4444 for illustration purposes. Feel free to adapt to your own ports. And make sure that you open these ports in the firewall or ask your institution’s administrators to do so.
  • Socket permissions: We have been permissive in the socket permissions by giving write access to all users. Feel free to also adjust these permissions to your purposes and to run the service with a different user/group. Make sure that your nginx and uWSGI can still talk to each other successfully after your changes.

So that’s it! I hope this guide and the associated repository will be helpful for all those trying to deploy their models into production as part of a web application or as an API. If you have any feedback, feel free to drop a comment below.

And thanks for reading! You might also be interested in checking my other articles on my Medium page:

Hamza Harkous — Medium_Read writing from Hamza Harkous on Medium. Postdoc at EPFL, Switzerland; working at the intersection of Privacy, NLP…

… or in seeing what I tweet about:

Hamza Harkous (@hamzaharkous) | Twitter_The latest Tweets from Hamza Harkous (@hamzaharkous). Postdoc @EPFL; working at the intersection of Privacy, NLP, and…