3,302 reads

Optimizing neural networks for production with Intel’s OpenVINO

by Oleksandr SavsunenkoNovember 27th, 2018

Too Long; Didn't Read

I work at <a href="https://skylum.com" target="_blank">Skylum</a> — the company making leading AI-enabled photo editing software <a href="https://skylum.com/luminar" target="_blank">Luminar</a>, <a href="https://skylum.com/aurorahdr" target="_blank">Aurora HDR</a> and <a href="https://photolemur.com" target="_blank">Photolemur</a>. Currently, our systems use Tensorflow as a neural computation engine. Delivering optimized and small neural networks for our customers is not an easy process. There are a few things you have to keep in mind — the size of the tensorflow build itself, size of neural models and their computation speed. TF isn’t perfect for that. The size of the native TensorFlow inference engine after all optimizations is at least 60 megabytes and optimizing models of edge CPU computations isn’t perfect as well. TFLite is a depreciated solution and TF Mobile is good at what it’s doing — optimizing for mobile CPU’s. The area of pure desktop optimization isn’t covered by any major library, so it’s a part of my interest.

Companies Mentioned

featured image - Optimizing neural networks for production with Intel’s OpenVINO

In the holy war between various neural network design frameworks one important step is usually missing — making production-ready and optimized deliverables. I’ve tested Intel’s OpenVINO optimizer system and it looks really promising.

Intro

I work at Skylum — the company making leading AI-enabled photo editing software Luminar, Aurora HDR and Photolemur. Currently, our systems use Tensorflow as a neural computation engine. Delivering optimized and small neural networks for our customers is not an easy process. There are a few things you have to keep in mind — the size of the tensorflow build itself, size of neural models and their computation speed. TF isn’t perfect for that. The size of the native TensorFlow inference engine after all optimizations is at least 60 megabytes and optimizing models of edge CPU computations isn’t perfect as well. TFLite is a depreciated solution and TF Mobile is good at what it’s doing — optimizing for mobile CPU’s. The area of pure desktop optimization isn’t covered by any major library, so it’s a part of my interest.

As a part of an ongoing collaborative effort with Intel, I am actively looking to switching our inference to something that is native for CPUs and readily adopted for Intel’s build-it GPUs, that majority of Intel chipsets currently has. Here I am reporting my test results of their OpenVINO optimization package.

Technical details

I chose the task of semantic segmentation as a very representative problem for our software. For example, it powers our AI Sky Enhancer filter, as well as a range of upcoming effects. Obviously, I can’t report results of our actual neural network architecture, so I’ve chosen the one that resembles is close enough — DeepLab V3+ with MobileNetV2 head with enabled ASPP layers. For a bit broader scope I chose two versions of this network — with output stride 8 and 16, delivering different output resolutions.

To generate those I’ve used official Google’s repository, trained the models and exported the frozen computation graphs using officially offered export script, naming input and output layers as “input” and “segmap” respectfully. The resulting files were tested for TF performance using a simple python script that loaded and evaluated frozen .pb graph. Resolution was 513x513px and CPU used is Intel(R) Core(TM) i7–8700K CPU operating at 3.70GHz

Testing the performance of the same model using OpenVINO wasn’t straightforward and I am really grateful for help coming from Intel here. First of all, I installed OpenVino at my test system, running Ubuntu 16.04 following the official instruction. Initial attempt of running optimization script was a failure. Here’s what I got.

python mo.py — input_model /data/1.pb — input_shape “(1,513,513,3)” — log_level=DEBUG — data_type FP32 — output segmap — input input — scale 1 — model_name test — framework tf — output_dir ./

[ ERROR ] Stopped shape/value propagation at "GreaterEqual" node.tensorflow.python.framework.errors_impl.InvalidArgumentError: Input 0 of node GreaterEqual was passed int64 from add_1_port_0_ie_placeholder:0 incompatible with expected int32.

The problem here was a range of layers that OpenVINO supports. When DeepLab exports the model it actually includes a range of pre- and postprocessing operations (resizing, normalization, etc) to make use of the model as easy as possible. It is done using built-in TensorFlow operations, that are sometimes under optimal and poorly written. For example, their resize function and buggy and gave me a lot of headaches recently. So, I asked for help at Intel’s forum and one of their employees made a GitHub repo + explanation.

Long story short — you have to cut of preprocessing and post-processing operations from TensorFlow graph and optimize only actual neural network operations. That actually makes a lot of sense as those operations are much faster to implement using C++ for example.

Inference speed comparison

So, after doing that I’ve made a head-to-head comparison of a few version of TF graph inferenced using TF engine and OpenVINO engine. What’s actually proposed in Fiona’s GitHub repo is to actually cut TF graph into 3 pieces — preprocessing, inference and postprocessing. Pre- and post-processing is to be done with TF and inference is outsourced to OpenVINO.

When I went straight ahead for direct comparison using the script provided I found no difference in execution time. A gut feeling suggested that it can’t be true, so I measured all three stages separately. The idea was that starting TF engine twice is a time-hungry operation and taking into account fact of non-ideal optimization of various TF operators there could be an overhead. That happened to be true.

time cost preprocess: 0.0021 sectime cost to inference : 0.1979 sectime cost postprocess: 0.2642sec

The postprocessing with Tensorflow actually took more time than inference itself!

To handle this situation I made a fair use comparison. I’ve cut off the pre- and post-processing stages implemented them in OpenCV and Numpy and run a comparison of the same stack of operations with TF and OpenVINO.

Important notice is the level of optimizations. My Tensorflow build on this machine is an official TF 1.11 from Google’s repo, build without AVX2 and FMA optimizations. OpenVINO currently offers two levels of optimization built into their distribution — AVX level and SSE level. For our production, we are mostly interested in SSE in terms of backward compatibility, but I am reporting here both versions.

Inference speed comparison between TensorFlow and OpenVINO on a DeepLabV3+ / MobileNetV2 / ASPP head network.

As you can see OpenVINO offers a very significant boost of speed in this fair head-to-head comparison in range of 30–50%, presumably exploiting their knowledge of Intel architectures.

Further research

I was still interested to see the performance in other use cases and maybe try something more “pure” like a PB model with no pre- and post-processing and maybe more conventional architecture, like ResNet/DenseNet or even VGG. Hopefully, this gives some insights into the capabilities of OpenVINO.

I would be very interested in testing inference using Intel’s built-in GPUs, but OpenCL availability for a different platform is an issue here, so I’ll postpone this until there’s a better perspective for consistent Multiplatform production-ready solutions. Also, the very important topic of FT16 and INT8 inference is not covered here. I wasn’t able to run the same experiment in half precision and found no mentions of INT8 quantization optimizations in OpenVINO.

And, it would be cool to compare OpenVINO to TensorRT. Your claps and follows will motivate me for further research, stay tuned!