Should I use Amazon SageMaker for Deep Learning?

Amazon Web Services offer multitude of products related to Machine Learning in one way or another. There are almost plug-and-play API services like “Rekognition” for image and video analysis, “Lex” for conversational interfaces (chat bots), “Comprehend” for text analysis, “Transcribe” for speech-to-text and “Polly” for text-to-speech conversions, etc. These specialized services provide pre-trained models which can be easily integrated in your applications via API. Quite recently Amazon has launched a lower level, general purpose service called “SageMaker”. This is more of a platform, tailor-made for common Machine Learning workflows. Finally, there is the good old “EC2” service, that offers compute instances of many sizes and shapes, including the ones with GPU.

As you can see, Amazon covers the whole spectrum between dedication and flexibility. The slogan is:

Machine learning in the hands of every developer and data scientist.

This article gives a brief overview of Amazon SageMaker service and highlights several things you should consider making a decision whether to use this service or not. Please be aware that author’s experience with SageMaker is limited to Deep Learning for image and video analysis. This kind of Machine Learning is associated with huge datasets and memory-intensive training. Some concerns might be completely irrelevant for other applications.

SageMaker

Build, train, and deploy machine learning models at scale.

AWS SageMaker was designed with the focus on seamless adoption by the machine learning community and easy deployment of trained models to production. It offers python and Jupyter Notebook — everything we normally use to play with Machine Learning. And no special configuration is required to start training some model remotely in the Cloud.

First you spin up a so called “notebook instance” which will host the Jupyter Notebook application itself, all the notebooks, auxiliary scripts and other files. No need to connect to that instance (actually you cannot, even if wanted to) or set it up in any way. Everything is already prepared for you to jump into some exiting Machine Learning! Create a new notebook and use it to collect and prepare some data, to define a model, and to start the learning process. All from one place, using python, popular ML frameworks and SageMaker’s own libraries.

Speaking of frameworks and libraries, SageMaker supports TensorFlow and Apache MXNet out-of-the-box. It also comes with some built-in algorithm, for instance, PCA, K-Means and XGBoost. You can even use Apache Spark to pre-process the data. If this is not enough, if you want to use your own favorite framework or maybe even another programming language, well, you can. It won’t be as simple, but you can package any machine learning algorithm into a Docker container and plug it into SageMaker’s training-serving pipeline.

Overall SageMaker workflow is the following (see the figure below). You prepare data and model in a Jupyter notebook, then configure and launch training using SageMaker SDK. On your command SageMaker spins up one or several “training instances”, uploads all necessary scripts and data there and runs the training. Once the training is done, you can deploy your model to a production-ready cluster and serve predictions via HTTP API. SageMaker will even scale the cluster automatically within the specified limits.

Generally deployment of a model to production is a tricky task, but SageMaker taker care of it. Using either python SDK or web interface you define an HTTP endpoint for your model and the rest just happens. Quite impressive. And you also get logging and monitoring of the cluster for free! Well, not literally for free…

What to consider

As mentioned before, training in SageMaker workflow is launched right from a Jupyter notebook. All configuration, provisioning of compute instances, moving of data, etc. is triggered literally with a single function call. This nifty process dictates a certain approach to defining the models and organizing the data. We will come back to the data later.

If, for instance, your framework of choice is Tensor Flow, you should adopt its high-level Estimator API to define the model and how data is fed to it. Basically SageMaker expects from you a python script containing one function to define a model, and three functions to feed the data in different modes: training, testing (evaluating) and predicting (serving). The rest — running training loop, saving at checkpoints, etc. — is handled behind the scenes. These functions, their parameters and expected return values are well documented here. Several examples can also be found in primer notebooks on any notebook instance.

Now, what is the matter with the data? Obviously SageMaker is built on top of other AWS services. Notebook, training and deployment machines are just ordinary EC2 instances running specific Amazon Machine Images (AMI). And data (also results, checkpoints, logs, etc.) is stored in S3 object storage. This might make a trouble, if you are working with images, videos or any huge datasets. The fact is that you have to upload all your data to S3. When you configure the training you tell SageMaker where to find your data. SageMaker then automatically downloads the data from S3 to every training instance before starting the training. Every time. For a reference, it takes around 20 minutes to download 100Gb worth of images. Which means that you have to wait at least 25 minutes before the training begins. Good luck debugging your model! On the other hand, when all preliminary trials are done elsewhere and your model is already polished, the training experience is very smooth. Simply upload your data to S3 and get interim results from there as well.

Another aspect for consideration is pricing. Notebook instances can be very cheep, especially when there is no need to pre-process the data. Training instances, on the other hand, may easily burn a hole in your pocket. The smallest GPU instance with one NVIDIA K80 GPU, 12Gb of video memory and 61Gb of RAM in Ireland costs $1.361 per hour at the moment. Simple math shows that one week of training costs around $230. Prices are a bit lower in US regions and you will be paying $1.26 per hour for the same type of instance. Check here all the prices, as well as the list of regions where SageMaker has already been launched.

To sum it up, Amazon SageMaker offers:

habitual environment (Jupyter Notebook, python);
popular ML frameworks (Apache MXNet, TensorFlow);
option to bring any other frameworks and libraries;
“zero-configuration”workflow for training;
out-of-the-box support for multi-node training;
straightforward deployment of trained models to production.

On the other hand, SageMaker is:

opinionated;
expensive.

Amazon SageMaker is a very interesting service worth giving it a try. In my case though, the fact that the data should be stored in S3 and then copied to a training instance every time became a deal-breaker. Not being able to test and debug my models locally, I would have to wait a lot for a feedback from every trail.

Tips

If you have decided to give SageMaker a shot, there are a few things to keep in mind.

SageMaker is a relatively new service in the AWS family and it is not yet available in all the regions. Check pricing page for the list of available regions.

Next is a curious one. Theoretically S3 object storage is a global service that is not bound to a particular region. However, it is still possible to pick a region while creating a new bucket. When you create a bucket for your data, make sure to pick the same region where you plan to run your SageMaker instances. Otherwise SageMaker will not be able to access the data.

Amazon doing a decent job to protect its clients from spending their money by mistake. Part of this effort is the system of resource limits. SageMaker will not allow you to accidentally spin some crazy amount of expensive instances and waste your whole budget in one hour. For cheep instances the limit is fairly high, say 5, 10, 20 instances. For particularly expensive ones by default the limit is set to zero. Meaning you are not allowed to run such instances. This is the case for GPU instances. So, if you plan to train on GPU, make sure to request increase of the limit in advance. This process is straightforward, but the request goes through an actual human-being in the support which might take some time.

An alternative

If SageMaker does not fit you needs, consider an alternative — Spot Instances of EC2. Of course, training of ML models on Spot Instances requires some extra work, but it is very flexible and might save you up to 70% in costs. In the next article I explain how to automate related routine tasks. Stay tuned!

Should I use Amazon SageMaker for Deep Learning?

SageMaker

What to consider

Tips

An alternative

References