Lossy Compression of Neural Networks in TensorFlow Graph Files (This post motivates and describes applications of findings from Han, Mao, and Dally’s paper on neural net compression on TensorFlow Graph Files. Huge shout out to Michael Shumikhin , Anish Athalye , and Shai Szulanski — who helped motivate and advance this work.) Neural nets are big. Sometimes it’s useful when they’re smaller (so you can store more, or deploy them with less bandwidth). sets forth ways to do theoretical compression of NNs. You can use off the shelf compression algorithms and to achieve 3x size reductions in TF graphfiles, without having to retrain the graph and with no (noticed) significant loss in accuracy. TL;DR This paper our library Artificial neural networks have been gaining popularity over the past few years… … is an understatement. Machine is possibly most hyped field in computer science right now, and neural nets are used for almost all things ‘machine learning’: sentiment analysis, decision making, image classification, etc. Neural networks work by taking in a ‘training’ set of data (input and output pairs), and learning to predict the output based on the input; once trained, a neural network can predict the outputs for a ‘test’ set (data that it hasn’t previously seen). One example of an input and output pair might be the pixels of an image, and the output could be a value 0 or 1 indicating whether that image contains a dog. Another example might be an input of pixels from an image and the output is a 3-dimensional vector, where the first dimension is 0 or 1 to indicate the presence of a dog, the second dimension indicates the presence of a cat, and the third dimension indicates the presence of a bird. Neural networks in general have been shown to be be incredibly robust and are currently deployed in production at many top companies including /Facebook/Amazon/Etc. learning the Google The essence of neural networks is a sequence of weighted linear regressions, combined with transformations on the outputs of regressions (these transformations range from the identity function, to tanh, to ). The neural network can ‘learn’ these weights through an algorithm called back propagation, which relies on gradient descent to constantly tweak the weights to improve accuracy over the training set of data. ReLUs Note: This is an oversimplification, and CNNs and RNNs can complicate this — but it will work for demonstrative purposes. If you’re interested in learning more more, I highly recommend reading the intro from Stanford’s CS231n . Remembering back to 8th grade algebra, linear regressions take in a value, and produce a value by a coefficient and adding a bias. This is the classic equation. Where is the weight, and is the bias. In multiple dimensions, this works much the same way, the only difference is that and are vectors. y=mx+b m b x m In a neural network, where one is doing many, many high dimensional regressions —the values for weights and biases can take up a lot of memory. Using , a popular machine learning library open-sourced by Google, as an example: a trained deep mnist graph (which classifies , and is used as the ) takes up . The (used for ) takes up ( , WARNING: large download). As we continue to iterate on these nets and store multiple versions, the size becomes problematic. It would be nice to have a smaller representation to store multiple versions, send over the network to deploy on servers (or self driving cars…), and store on mobile devices. TensorFlow handwritten digits getting started demo for TensorFlow 12MiB Google Inception V3 Graph image classification 91MiB found here Crash Course in TensorFlow Graphs TensorFlow structures computation in graphs. These graphs store operations (ops) and the relationship between the inputs of an operation and outputs from previous operations. These relationships are stored as ‘tensors’, which are basically n-dimensional arrays whose values are not necessarily known and whose shape may be partially known. You can run a TensorFlow graph by feeding values into the graph for some Tensors and running them through operations until you’ve reached the desired output. For example, if I want to model the equation in TensorFlow, I can do it with the following Python: y=mx+b import tensorflow as tf y = tf.add(tf.mul(tf.Variable(2, name="m"),tf.placeholder(tf.int32, shape=1, name="x")),tf.Variable(1, name="b")) The example TensorFlow graph for y=mx+b. As you can see the ops are: “Add”, “Mul”, “m”, “b”, and “x”. Tensors are named by , so for example has one output (the result of ), and it would be named . Variables and placeholders (placeholders are Variables whose value must be fed on any graph run) are also operations with one output tensor (their own value). In this example, the value of , , , are fairly self evident, the operation takes in and as inputs to produce , and takes in and to produce . You may notice there is no in this graph, that is because is effectively , although if we really wanted to you could also create a variable and assign it to (although that seems like overkill). To run this graph, one might feed in a value for and ask for , using a command like . OPERATION:OUTPUT_INDEX Mul m*x Mul:0 m:0 x:0 b:0 Mul m:0 x:0 Mul:0 Add Mul:0 b:0 Add:0 y y Add:0 y Add:0 x:0 Add:0 sess.run(["Add:0"], feed_dict={"x:0": (5,)}) These graphs are encoded and stored in a protobuf. Values for variables are not stored in a graph def (as they can be assigned over multiple executions) and are therefore not part of the graph; however, variables can be saved in a checkpoint file (which is basically a dictionary containing variable name and value… at least according to Google, the file type is much more opaque than GraphDefs). Variables are useful when we’re fitting a regression (or training neural net), but it doesn’t make much sense to keep them as variables once we know we will no longer be changing them. Instead we can convert them to constant values (either using the function or the tool distributed with TensorFlow) and simplify our graph. An example GraphDef after freezing our toy model contains a collection of nodes and (possibly) the value of the tensor they contain: GraphDef convert_variables_to_constants freeze_graph mx+b node {name: "m"op: "Const"attr {key: "dtype"value {type: DT_INT32}}attr {key: "value"value {tensor {dtype: DT_INT32tensor_shape {}int_val: 2}}}} ... node {name: "Add"op: "Add"input: "Mul"input: "b/read"attr {key: "T"value {type: DT_INT32}}} Note: Tensors with values that have more than one scalar are stored as packed bytes in a string . They can be transformed to/from arrays with the tf.contrib.util.make_ndarray and .make_tensor_proto functions . To read more on the variety of TensorFlow model files, . see here Compression https://github.com/tomasreimers/tensorflow-graph-compression Wanting to compress TensorFlow Graph files, the first thing we did was try gzip, a popular compression tool. Gzip was not super effective. It did not significantly reduce the file size of the saved deep mnist graph (rounds to before and after compression) and reduced the file size of the inception graph from to (even using compression). Facing similar results with other lossless compression algorithms, I decided to consider lossy neural net compression. 12MiB 91MiB 85MiB gzip -9 TensorFlow graph files are largely weights. Stripping the tensors containing the various weights from the GraphDef files, the deep mnist graph is only and the inception graph is only . Knowing that, I began to look for ways to lossy compress the weights. 3.7KiB 120KiB Neural net weight imprecision has started to emerge as an area of research for a variety of reasons: hardware manufacturers are interested in it because it might allow them to make more ( ), and software engineers are interested in it because it might allow them to and possibly even . imprecise hardware that can run neural nets faster NVidia’s TensorRT has hardware level support for INT8 and Float16 modes to run nets faster reduce the space taken up by neural nets software level speedups or energy efficiency The seminal research in the field of weight compression was conducted by . This research assumed that one still had access to the training set of the neural net; however, for our purposes we’re going to assume that we don’t have access to that dataset (for example, if we have a pre-trained net from a model zoo). This means that we can not do pruning (which built on ), as that would require being able to train the network; however we still can do quantization. Han, Mao, and Dally and showed that neural networks could be reduced in size with pruning, quantization, and Huffman coding Han’s previous work The insight behind quantization is that weights in a neural network are fundamentally imprecise by the very nature of the back-propagation algorithm, meaning that they don’t necessarily need all the precision we allocate for them. Currently, weights (assuming they’re standard 32b floats) can take on any one of 2³² different values, meaning that their representation must have a minimum of 32 bits. However, if we only let the weights take on 2⁸ values (8 is arbitrary, we can pick any integer < 32), we would only need 8 bits to represent them (you could imagine replacing each weight with an integer 0–255, and having a codebook of negligible size to translate the integer to a floating point value for that weight). The dynamic programming solution to find optimal clustering in 1D. So now the problem becomes, how can we select 2⁸ values to appropriately represent all of our weights? Well this becomes a clustering problem, and Hans et al. did this using K-Means clustering (which finds a local optima) on the collection of scalar values from all of the weights, and replacing each scalar with its cluster centroid. We implemented this algorithm, and realized we could actually improve it (shout out to Shai), because the collection of the scalars in all the weights exists in one-dimensional space, and we can use dynamic programming to find the global optima for clustering in 1D in O(kN²) time and O(kN) space, where k is the number of clusters and N is the number of points being clustered. We implemented optimal clustering , although it will likely need to be reimplemented in C++ if we want it to be fast enough to use on real nets. here Note: We have been minimizing the distance between each scalar and the value we replace it with, using a predetermined number of clusters; however if you have more knowledge of your net and want to use another loss function, you should check out this paper which proposes an O(N²) for clustering without a predetermined number of clusters. Having replaced all the weight scalars with the centroid of their cluster, we should have the exact same file size because the cluster centroids still have the same amount of bits as the values they replaced. In order to see a size reduction we would need to implement a codebook and replace each weight with a lower bit representation indexing into the codebook. If we replace 32 bit weights with a 8 bit index into the codebook, this should achieve a . In addition, Han et al. point out that we could use to achieve even greater compression ratios (assuming the weights are not evenly distributed). 4x reduction Huffman coding Evidence that weights are not evenly distributed in nets, and that Huffman encoding may be useful. The images above depict 98% of the data, I’ve removed the 1st and 99th percentile as they are outliers and extend the range by so much it renders the graphs difficult to read. However, implementing a codebook with Huffman coding is painful (or at least time-consuming) and there is no reason we have to do it ourselves (Shout out to Michael for pointing this out): by replacing the weights with their centroids, we’ve increased redundancy in the file; this means that we can use off the shelf compression algorithms and they will create a codebook with Huffman encoding for us (but possibly less with a worse compression ratio than if we did the codebook manually). Using GZip on the new files, we compressed mnist from (using a global codebook) and inception from (using a per layer codebook). We also found to be exceptional and further reduce mnist to and inception to while these are less than the 4x we should see in theory, it is still a higher compression ratio than we originally had and we’re not dependent on a custom encoding. 12MiB to 5MiB 91MiB to 37MiB Facebook’s zstd 4MiB 34MiB; The nets with new weights did not show significant increases in error. Mnist with weight replacement achieved an accuracy of (compared to before weight replacement… the improvement in accuracy is likely due to random chance). An interesting area of further research would be to try and encode the weights with even fewer bits (imagine a 2⁶ entry codebook, where each weight can be encoded in 6 bits) and graph the size/accuracy tradeoff. 0.9644 0.9642 I’ve open sourced the weight compression script, and it is available here . Improvements and pull requests welcome :) Conclusions Neural nets are incredibly powerful and being deployed in more and more places. Currently, the representation of neural nets is space intensive, and this might be a problem as we begin to export neural nets to more and more platforms (both mobile and web, consider projects like which is incredibly exciting and pushing the frontiers of what’s possible, but whose requires the user to download ~100MiB of weights to run inception in the browser). We demonstrate that we can cluster weights to encode them to reduce file sizes. Additionally, we show there is no need to implement custom decoders because by simply replacing the weights with some similar value we can increase redundancy in the representation of the net and thus increase the compression ratios we can expect with off-the-shelf compression algorithms ( ); all without too much loss of accuracy. Keras.js demos added win: gzip is implemented in most servers and browsers by default ;) P.s. Throughout this project, Apple Finder’s notion of base-10 MB vs Terminal’s notion of base-2 MiB could be wildly confusing at times -_- Relevant XKCD . You can now also do something similar using TensorFlow’s . Update: quantize_weights graph transform

Amazon

Apple

Facebook

Google

NVIDIA

Scalar

Super

Too Long; Didn't Read

In your car, at home, or at work — Bosch technology shapes many areas of life.

When Smaller’s Better

Too Long; Didn't Read

Companies Mentioned

Tomas Reimers

About Author

TOPICS

THIS ARTICLE WAS FEATURED IN...

When Smaller’s Better

Too Long; Didn't Read

Companies Mentioned

Tomas Reimers

About Author

TOPICS

THIS ARTICLE WAS FEATURED IN...

RELATED STORIES