👨💻 H2Oai 🎙 CTDS.Show & CTDS.News 👨🎓 fast.ai 🎲 Kaggle 3x Expert
Note of thanks: This benchmark comparison wouldn’t have been possible without the help of Tuatini GODARD, a great friend, an active freelancer. If you’d like to know more about him you can read his complete interview here.
A big thank you to Laurae for many valuable pointers towards improving this post.
The latest version of fastai (2019) just launched, you’d definitely want to check it out: course.fast.ai
Note: This is not a sponsored post from fastai, I’ve taken up the course and have learned a lot from it. Personally, I’d highly recommend it if you’re just getting started with Deep Learning.
This is a quick walkthrough of what FP16 op(s) are, a quick explanation of how mixed precision training works followed by a few benchmarks (well mostly because I wanted to brag to my friend that my Rig is faster than his and partly because of research purposes)
Note: This isn’t a performance benchmark, this is a comparison of training time on 2 builds based on 2080Ti and 1080Ti respectively.
More details later.
Deep Learning is a bunch of matrix op(s) being handled by your GPU. These generally happen using something called FP32, or 32-bit Floating point matrices.
With the recent architectures and CUDA releases, FP16 or 16-bit Floating point computation has become easy. What this allows you to virtually do is, since you’re using tensors of half the size, you can crunch through more examples by increasing your batch_size: or it allows you to use lesser GPU RAM compared to using FP32 training (Also known as Full Precision Training).
In plain English, you can replace (batch_size) with (batch_size)*2 in your code.
The tensor cores are much faster in FP16 computing, which means that you get a speed/performance boost and use lesser GPU RAM as well!
Issues involved with Half Precision (The name is derived as 16-bit floating point variables have half the precision of the 32-bit floating point variables):
Due to an obvious loss of precision.
Enter, Mixed Precision
To avoid the above-mentioned issues, we do operations in FP16 and switch to FP32 wherever we suspect a loss in precision. Hence, Mixed Precision.
Step 1: Use FP16 wherever possible-for faster compute:
The input tensors are converted to fp16 tensors to allow for faster processing
Step 2: Use FP32 to compute loss (To avoid under/overflow):
The tensors are converted back to FP32 to compute loss values in order to avoid under/overflow.
The FP32 tensors are used to update the weights and then converted back to FP16 to allow forward and backward passes.
Step 4: Loss scaling is done by multiplying or dividing by a scaling factor:
The loss is scaled by multiplying or dividing by a loss scaling factor.
As one may expect from the library, doing mixed precision training in the library is as easy as changing:
learn = Learner(data, model, metrics=[accuracy])
learn = Learner(data, model, metrics=[accuracy]).to_fp16()
You can read the exact details of what happens when you do that here.
The module allows to change the forward and backward passes of training using fp16 and allowing a speedup.
Internally, the callback ensures that all model parameters (except batchnorm layers, which require fp32) are converted to fp16, and an fp32 copy is also saved. The fp32 copy (the
master parameters) is what is used for actually updating with the optimizer; the fp16 parameters are used for calculating gradients. This helps avoid underflow with small learning rates.
Our hardware configurations slightly vary so do take the values with a grain of salt.
Since the process isn’t very RAM intensive nor CPU intensive we chose to share our results here.
Below are graphs of training times for the respective ResNets.
Note: Less is better (X-axis represents time in seconds and scaled time)
The smallest Resnet of all.
To allow experimentation of Mixed Precision and FP16 training, Nvidia has released Nvidia apex which is a set of NVIDIA-maintained utilities to streamline mixed precision and distributed training in Pytorch. The intention of Apex is to make up-to-date utilities available to users as quickly as possible.
It also features a few examples that we can run directly without much tweaking-this seemed to be another good test for a quick spin.
Language Modelling comparison:
The example in the GitHub repo trains a multi-layer RNN (Elman, GRU, or LSTM) on a language modeling task. By default, the training script uses the Wikitext-2 dataset, provided. The trained model can then be used by the generate script to generate new text.
We weren’t concerned with the generation of test-our comparisons are based on training the example for 30 epochs on Mixed Precision, Full Precision for the same batch sizes on the different setups.
Enabling fp16 is as easy as passing a “ — fp16” argument while running the code, APEX works on top of the PyTorch environment that we had already setup. Hence this seemed to be a perfect choice.
Below are the results from the same:
Although performance-wise the RTX cards are much more powerful than the 1080Ti, for smaller networks especially, the difference in train time isn’t as pronounced as I had expected.
If you decide to try Mixed Precision training, a few bonus points are:
During the testing, we were not able to run the code until we had updated our environments.
If you have any questions, please leave a note or comment below.
If you found this interesting and have any suggestions for interviewing someone, you can find me on Twitter here.
If you’re interested in reading about Deep Learning and Computer Vision news, you can check out my newsletter here.