## Lossy Compression of Neural Networks in TensorFlow Graph Files

*(This post motivates and describes applications of findings from **Han, Mao, and Dally’s paper on neural net compression** on TensorFlow Graph Files. Huge shout out to **Michael Shumikhin**, **Anish Athalye**, and **Shai Szulanski** — who helped motivate and advance this work.)*

**TL;DR **Neural nets are big. Sometimes it’s useful when they’re smaller (so you can store more, or deploy them with less bandwidth). This paper sets forth ways to do theoretical compression of NNs. You can use off the shelf compression algorithms and our library to achieve 3x size reductions in TF graphfiles, without having to retrain the graph and with no (noticed) significant loss in accuracy.

Artificial neural networks have been gaining popularity over the past few years…

… is an understatement. Machine learning is possibly *the* most hyped field in computer science right now, and neural nets are used for almost all things ‘machine learning’: sentiment analysis, decision making, image classification, etc. Neural networks work by taking in a ‘training’ set of data (input and output pairs), and learning to predict the output based on the input; once trained, a neural network can predict the outputs for a ‘test’ set (data that it hasn’t previously seen). One example of an input and output pair might be the pixels of an image, and the output could be a value 0 or 1 indicating whether that image contains a dog. Another example might be an input of pixels from an image and the output is a 3-dimensional vector, where the first dimension is 0 or 1 to indicate the presence of a dog, the second dimension indicates the presence of a cat, and the third dimension indicates the presence of a bird. Neural networks in general have been shown to be be incredibly robust and are currently deployed in production at many top companies including Google/Facebook/Amazon/Etc.

The essence of neural networks is a sequence of weighted linear regressions, combined with transformations on the outputs of regressions (these transformations range from the identity function, to tanh, to ReLUs). The neural network can ‘learn’ these weights through an algorithm called back propagation, which relies on gradient descent to constantly tweak the weights to improve accuracy over the training set of data.

*Note: This is an oversimplification, and **CNNs** and **RNNs** can complicate this — but it will work for demonstrative purposes. If you’re interested in learning more more, I highly recommend reading the intro from **Stanford’s CS231n**.*

Remembering back to 8th grade algebra, linear regressions take in a value, and produce a value by a coefficient and adding a bias. This is the classic `y=mx+b`

equation. Where `m`

is the weight, and `b`

is the bias. In multiple dimensions, this works much the same way, the only difference is that `x`

and `m`

are vectors.

In a neural network, where one is doing many, many high dimensional regressions —the values for weights and biases can take up a lot of memory. Using TensorFlow, a popular machine learning library open-sourced by Google, as an example: a trained deep mnist graph (which classifies handwritten digits, and is used as the getting started demo for TensorFlow) takes up **12MiB**. The Google Inception V3 Graph (used for image classification) takes up **91MiB **(found here, WARNING: large download). As we continue to iterate on these nets and store multiple versions, the size becomes problematic. It would be nice to have a smaller representation to store multiple versions, send over the network to deploy on servers (or self driving cars…), and store on mobile devices.

#### Crash Course in TensorFlow Graphs

TensorFlow structures computation in graphs. These graphs store operations (ops) and the relationship between the inputs of an operation and outputs from previous operations.

These relationships are stored as ‘tensors’, which are basically n-dimensional arrays whose values are not necessarily known and whose shape may be partially known. You can run a TensorFlow graph by feeding values into the graph for some Tensors and running them through operations until you’ve reached the desired output. For example, if I want to model the equation `y=mx+b`

in TensorFlow, I can do it with the following Python:

import tensorflow as tf

y = tf.add(

tf.mul(

tf.Variable(2, name="m"),

tf.placeholder(tf.int32, shape=1, name="x")

),

tf.Variable(1, name="b")

)

As you can see the ops are: “Add”, “Mul”, “m”, “b”, and “x”. Tensors are named by `OPERATION:OUTPUT_INDEX`

, so for example `Mul`

has one output (the result of `m*x`

), and it would be named `Mul:0`

. Variables and placeholders (placeholders are Variables whose value must be fed on any graph run) are also operations with one output tensor (their own value). In this example, the value of `m:0`

, `x:0`

, `b:0`

, are fairly self evident, the `Mul`

operation takes in `m:0`

and `x:0`

as inputs to produce `Mul:0`

, and `Add`

takes in `Mul:0`

and `b:0`

to produce `Add:0`

. You may notice there is no `y`

in this graph, that is because `y`

is effectively `Add:0`

, although if we really wanted to you could also create a `y`

variable and assign it to `Add:0`

(although that seems like overkill). To run this graph, one might feed in a value for `x:0`

and ask for `Add:0`

, using a command like `sess.run(["Add:0"], feed_dict={"x:0": (5,)})`

.

These graphs are encoded and stored in a GraphDef protobuf. Values for variables are not stored in a graph def (as they can be assigned over multiple executions) and are therefore not part of the graph; however, variables can be saved in a checkpoint file (which is basically a dictionary containing variable name and value… at least according to Google, the file type is much more opaque than GraphDefs). Variables are useful when we’re fitting a regression (or training neural net), but it doesn’t make much sense to keep them as variables once we know we will no longer be changing them. Instead we can convert them to constant values (either using the convert_variables_to_constants function or the freeze_graph tool distributed with TensorFlow) and simplify our graph. An example GraphDef after freezing our toy `mx+b`

model contains a collection of nodes and (possibly) the value of the tensor they contain:

node {

name: "m"

op: "Const"

attr {

key: "dtype"

value {

type: DT_INT32

}

}

attr {

key: "value"

value {

tensor {

dtype: DT_INT32

tensor_shape {

}

int_val: 2

}

}

}

}

...

node {

name: "Add"

op: "Add"

input: "Mul"

input: "b/read"

attr {

key: "T"

value {

type: DT_INT32

}

}

}

*Note: Tensors with values that have more than one scalar are stored as **packed bytes in a string**. They can be transformed to/from arrays with the **tf.contrib.util.make_ndarray and .make_tensor_proto functions**.*

To read more on the variety of TensorFlow model files, see here.

#### Compression

https://github.com/tomasreimers/tensorflow-graph-compression

Wanting to compress TensorFlow Graph files, the first thing we did was try gzip, a popular compression tool. Gzip was not super effective. It did not significantly reduce the file size of the saved deep mnist graph (rounds to **12MiB** before and after compression)** **and reduced the file size of the inception graph from **91MiB** to **85MiB **(even using `gzip -9`

compression). Facing similar results with other lossless compression algorithms, I decided to consider lossy neural net compression.

TensorFlow graph files are largely weights. Stripping the tensors containing the various weights from the GraphDef files, the deep mnist graph is only **3.7KiB** and the inception graph is only **120KiB**. Knowing that, I began to look for ways to lossy compress the weights.

Neural net weight imprecision has started to emerge as an area of research for a variety of reasons: hardware manufacturers are interested in it because it might allow them to make more imprecise hardware that can run neural nets faster (NVidia’s TensorRT has hardware level support for INT8 and Float16 modes to run nets faster), and software engineers are interested in it because it might allow them to reduce the space taken up by neural nets and possibly even software level speedups or energy efficiency.

The seminal research in the field of weight compression was conducted by Han, Mao, and Dally and showed that neural networks could be reduced in size with pruning, quantization, and Huffman coding. This research assumed that one still had access to the training set of the neural net; however, for our purposes we’re going to assume that we don’t have access to that dataset (for example, if we have a pre-trained net from a model zoo). This means that we can not do pruning (which built on Han’s previous work), as that would require being able to train the network; however we still can do quantization.

The insight behind quantization is that weights in a neural network are fundamentally imprecise by the very nature of the back-propagation algorithm, meaning that they don’t necessarily need all the precision we allocate for them. Currently, weights (assuming they’re standard 32b floats) can take on any one of 2³² different values, meaning that their representation must have a minimum of 32 bits. However, if we only let the weights take on 2⁸ values (8 is arbitrary, we can pick any integer < 32), we would only need 8 bits to represent them (you could imagine replacing each weight with an integer 0–255, and having a codebook of negligible size to translate the integer to a floating point value for that weight).

So now the problem becomes, how can we select 2⁸ values to appropriately represent all of our weights? Well this becomes a clustering problem, and Hans et al. did this using K-Means clustering (which finds a local optima) on the collection of scalar values from all of the weights, and replacing each scalar with its cluster centroid. We implemented this algorithm, and realized we could actually improve it (shout out to Shai), because the collection of the scalars in all the weights exists in one-dimensional space, and we can use dynamic programming to find the global optima for clustering in 1D in O(kN²) time and O(kN) space, where k is the number of clusters and N is the number of points being clustered. We implemented optimal clustering here, although it will likely need to be reimplemented in C++ if we want it to be fast enough to use on real nets.

*Note: We have been minimizing the distance between each scalar and the value we replace it with, using a predetermined number of clusters; however if you have more knowledge of your net and want to use another loss function, you should check out **this paper** which proposes an O(N²) for clustering without a predetermined number of clusters.*

Having replaced all the weight scalars with the centroid of their cluster, we should have the exact same file size because the cluster centroids still have the same amount of bits as the values they replaced. In order to see a size reduction we would need to implement a codebook and replace each weight with a lower bit representation indexing into the codebook. If we replace 32 bit weights with a 8 bit index into the codebook, this should achieve a **4x reduction**. In addition, Han et al. point out that we could use Huffman coding to achieve even greater compression ratios (assuming the weights are not evenly distributed).

However, implementing a codebook with Huffman coding is painful (or at least time-consuming) and there is no reason we have to do it ourselves (Shout out to Michael for pointing this out): by replacing the weights with their centroids, we’ve increased redundancy in the file; this means that we can use off the shelf compression algorithms and they will create a codebook with Huffman encoding for us (but possibly less with a worse compression ratio than if we did the codebook manually).

Using GZip on the new files, we compressed mnist from **12MiB to 5MiB **(using a global codebook) and inception from **91MiB to 37MiB **(using a per layer codebook). We also found Facebook’s zstd to be exceptional and further reduce mnist to **4MiB** and inception to **34MiB;** while these are less than the 4x we should see in theory, it is still a higher compression ratio than we originally had and we’re not dependent on a custom encoding.

The nets with new weights did not show significant increases in error. Mnist with weight replacement achieved an accuracy of **0.9644**(compared to **0.9642** before weight replacement… the improvement in accuracy is likely due to random chance). An interesting area of further research would be to try and encode the weights with even fewer bits (imagine a 2⁶ entry codebook, where each weight can be encoded in 6 bits) and graph the size/accuracy tradeoff.

*I’ve open sourced the weight compression script, and it is available **here**. Improvements and pull requests welcome :)*

#### Conclusions

Neural nets are incredibly powerful and being deployed in more and more places. Currently, the representation of neural nets is space intensive, and this might be a problem as we begin to export neural nets to more and more platforms (both mobile and web, consider projects like Keras.js which is incredibly exciting and pushing the frontiers of what’s possible, but whose demos requires the user to download ~100MiB of weights to run inception in the browser). We demonstrate that we can cluster weights to encode them to reduce file sizes. Additionally, we show there is no need to implement custom decoders because by simply replacing the weights with some similar value we can increase redundancy in the representation of the net and thus increase the compression ratios we can expect with off-the-shelf compression algorithms (*added win: gzip is implemented in most servers and browsers by default ;)*); all without too much loss of accuracy.

*P.s. Throughout this project, Apple Finder’s notion of **base-10 MB** vs Terminal’s notion of **base-2 MiB **could be wildly confusing at times -_- **Relevant XKCD**.*

**Update: **You can now also do something similar using TensorFlow’s quantize_weights graph transform.