“Information: the negative reciprocal value of probability.” — Claude Shannon
Aim of this blog is not to understand the underlying mathematical concepts behind Neural Network but to visualise Neural Networks in terms of information manipulation.
Before we start:
Encoder-Decoder is not two CNNs/RNNs combined together! Neither have to be neural network in fact!
Originally, a concept of information theory. Encoder is simply compresses the information and decoder expands the encoded information.
In case of machine learning, both encoding and decoding are both lose-full processes i.e. some information is always lost.
The encoded output of the encoder is called the context vector and this is the input for the decoder.
There are two ways to setup an encoder-decoder setting:
Example Applications:
Auto-Encoder: Encoder compressing English text into a vector. Decoder generating original English text from the vector.
Encoder-Decoder: Encoder compressing English text into a vector. Decoder generating French translation of the original text from the vector.
Encoder-Decoder: Encoder compressing English text into a vector. Decoder generating image from the content of text.
Now, if I say every neural network, itself, is an encoder-decoder setting; it would sound absurd to most.
Let’s re-imagine the neural networks.
Let input layer be X and their real tags/classes (present in the training set) be Y. Now we already know Neural Networks find the underlying function between X and Y.
So X can be seen as High-Entropy distribution of Y. High entropy because X contains the information Y but it also a lot of other information.
Example:
“This boy is good.” contains enough information to tell us about it’s ‘positive’ sentiment.
But.
It also contains things like:
1. It’s a specific Boy
2. It’s just one Boy
3. Tense of sentence is present
Now no-entropy version of this sentence would be ‘positive’. Yes that’s also the output. We’ll come back to that in a while.
Now imagine every hidden layer as a single variable H ( so layers will be named as H0, H1 ….. H(n-1) )
Now every layer becomes a variable and the neural network becomes a Markov Chain. Markov Chain because each variable is only dependent upon the previous layer only.
So essentially each layer forms a partisan of information.
Following is a visualisation of a Neural Network as a Markov Chain.
The last layer Y_ is supposed to result in least entropy output (with respect to ‘Y’, the original tag/class).
This process of obtaining Y_ is squeezing the information in X layer as it flows through H layers and retaining only the information most relevant to Y. This is the information bottleneck.
I(X,Y) = H(X) — H(X|Y)
H -> Entropy
H(X) -> Entropy of X
H(X|Y) -> Conditional Entropy of X given Y
In other words, H(X|Y) signifies how much uncertainty is removed form X if Y is known.
In the Markov Representation of Neural Network, every layer becomes a partition of information.
In Information Theory these partitions are known as Successive Refinement of Relevant Information. You don’t have to worry about the details.
Another way of seeing this is the input being encoded and decoded into the output.
So, for enough hidden layers:
Sample Complexity is the number and variety of examples one needs to receive certain accuracy.
We calculate Mutual Information between
Initial Conditions
Initially,weights are randomly initialised. So barely anything is known about the correct output. With successive layers the mutual information about input decreases and the information in hidden layers about the output is low as well.
Compression Phase
As we train the Neural Network the plots start moving up, signifying gain of information about the output.
But.
Plots also start shifting towards the right side, signifying increase of information in latter layers about the input.
This his the longest phase. Here the density of the plots is maximum and plots are concentrated at the top right. This signifies compression of information about the input in relation to the output.
This is called the compression phase.
Expansion Phase
After the compression phase, the plots starts shifting towards the top but also to left side.
This signifies, with successive layers, there is loss of information about the input and what’s retained in the last layer is the lowest entropy information about the output.
The Markov Chain version of Neural Network highlights one more point, learning happens from layer to layer. A layer has all the information it needs to predict the output (plus some noise).
So we use each layer to predict the output. This helps us peep into the layer-wise knowledge of the so called black box.
This gives us a perspective about how many layers are required to make an accurate enough prediction of output. If saturation is achieved at earlier layers, the layers succeeding this layer can be pruned/dropped.
These layers are usually very of hundreds or thousands of dimensions. Our evolution doesn’t allow us to visualise anything beyond the 3-dimensions. So we use dimension reduction techniques.
There are various methods of performing dimension reduction. Cristopher Olah has a brilliant blog explaining those methods. I won’t go into the details of t-SNE, you can check this blog for details.
To keep it short, t-SNE tries to reduce dimension retaining the neighbours from higher dimensions in lower dimensions. So this results in pretty accurate 2D and 3D plots.
Following are the layer plots of a language model with 2 layers.
About the plots:
2D plots of Layer 1 and Layer 2
3D plots of Layer 1 and Layer 2
I have excluded mathematics from this blog. If you are comfortable enough with the mathematics of Information Theory, Game Theory, Learning Theory etc then do watch this video of the Mastero Naftali Tishby.
Thank You \(-_- )/