Opening the Black Box …. Somewhat “Information: the negative reciprocal value of probability.” — Claude Shannon Aim of this blog is not to understand the underlying mathematical concepts behind Neural but to visualise Neural Networks in terms of information manipulation. Network Encoder-Decoder Before we start: Encoder-Decoder is not two CNNs/RNNs combined together! Neither have to be neural network in fact! Originally, a concept of information theory. Encoder is simply compresses the information and decoder expands the encoded information. In case of machine , both encoding and decoding are both lose-full processes i.e. some information is always lost. learning The encoded output of the encoder is called the and this is the input for the decoder. context vector There are two ways to setup an encoder-decoder setting: The Decoder in inverse function of Encoder. This way the decoder tries to reproduce the original information. This is used to de-noise the data. This setting has a special name, called auto-encoder. The Encoder is a compression algorithm and decoder is generating algorithm. This helps to transfer context from one format to another. Example Applications: Auto-Encoder: Encoder compressing English text into a vector. Decoder generating original English text from the vector. Encoder-Decoder: Encoder compressing English text into a vector. Decoder generating French translation of the original text from the vector. Encoder-Decoder: Encoder compressing English text into a vector. Decoder generating image from the content of text. Information Theory Now, if I say every neural network, itself, is an encoder-decoder setting; it would sound absurd to most. Let’s re-imagine the neural networks. Let input layer be X and their real tags/classes (present in the training set) be Y. Now we already know Neural Networks find the underlying function between X and Y. So X can be seen as High-Entropy distribution of Y. High entropy because X contains the information Y but it also a lot of other information. Example: “This boy is good.” contains enough information to tell us about it’s ‘positive’ sentiment. But. It also contains things like: 1. It’s a specific Boy 2. It’s just one Boy 3. Tense of sentence is present Now no-entropy version of this sentence would be ‘positive’. Yes that’s also the output. We’ll come back to that in a while. Now imagine every hidden layer as a single variable H ( so layers will be named as H0, H1 ….. H(n-1) ) Now every layer becomes a variable and the neural network becomes a . Markov Chain because each variable is only dependent upon the previous layer only. Markov Chain So essentially each layer forms a partisan of information. Following is a visualisation of a Neural Network as a Markov Chain. The last layer Y_ is supposed to result in least entropy output (with respect to ‘Y’, the original tag/class). This process of obtaining Y_ is squeezing the information in X layer as it flows through H layers and retaining only the information most relevant to Y. This is the information bottleneck. Mutual Information I(X,Y) = H(X) — H(X|Y) H -> Entropy H(X) -> Entropy of X H(X|Y) -> Conditional Entropy of X given Y In other words, H(X|Y) signifies how much uncertainty is removed form X if Y is known. Properties of Mutual Information When you move along the Markov Chain the mutual information only decreases Mutual Information is invariant to reparameterization i.e. shuffling values in a layer won’t change the output Revisiting Bottleneck In the Markov Representation of Neural Network, every layer becomes a partition of information. In Information Theory these partitions are known as Successive Refinement of Relevant Information. You don’t have to worry about the details. Another way of seeing this is the input being encoded and decoded into the output. So, for enough hidden layers: the Sample Complexity of deep neural network is determined by encoded Mutual Information at Last Hidden Layer the Accuracy is determined by Decoded Mutual Information of Last Hidden Layer Sample Complexity is the number and variety of examples one needs to receive certain accuracy. Mutual Information along the Training Phase We calculate Mutual Information between Layer and Input Layer and Output Initial Conditions Initially,weights are randomly initialised. So barely anything is known about the correct output. With successive layers the mutual information about input decreases and the information in hidden layers about the output is low as well. Compression Phase As we train the Neural Network the plots start moving up, signifying gain of information about the output. But. Plots also start shifting towards the right side, signifying increase of information in latter layers about the input. This his the longest phase. Here the density of the plots is maximum and plots are concentrated at the top right. This signifies compression of information about the input in relation to the output. This is called the compression phase. Expansion Phase After the compression phase, the plots starts shifting towards the top but also to left side. This signifies, with successive layers, there is loss of information about the input and what’s retained in the last layer is the about the output. lowest entropy information Virtualisation The Markov Chain version of Neural Network highlights one more point, learning happens from layer to layer. A layer has all the information it needs to predict the output (plus some noise). So we use each layer to predict the output. This helps us peep into the layer-wise knowledge of the so called black box. This gives us a perspective about how many layers are required to make an accurate enough prediction of output. If saturation is achieved at earlier layers, the layers succeeding this layer can be pruned/dropped. These layers are usually very of hundreds or thousands of dimensions. Our evolution doesn’t allow us to visualise anything beyond the 3-dimensions. So we use dimension reduction techniques. There are various methods of performing dimension reduction. Cristopher Olah has a brilliant explaining those methods. I won’t go into the details of t-SNE, you can check this for details. blog blog To keep it short, t-SNE tries to reduce dimension retaining the neighbours from higher dimensions in lower dimensions. So this results in pretty accurate 2D and 3D plots. Following are the layer plots of a language model with 2 layers. About the plots: Selected 16 words Used the final language model to find N synonyms of the above 16 words (N=200 for 2D and N=50 for 3D) Find the representation vector for each word at each layer Find 2D and 3D reduced representation of each of above selected words and their synonyms using t-SNE Plot the reduced representations 2D plots of Layer 1 and Layer 2 3D plots of Layer 1 and Layer 2 Summary Most every deep neural network acts like an encoder-decoder Most of the training time is spent in compression phase Layers are learned bottom-up Beyond compression phase, more the neural network forgets about the input, the more accurate it gets (wiping out the irrelevant part of input). I have excluded mathematics from this blog. If you are comfortable enough with the mathematics of Information Theory, Game Theory, Learning Theory etc then do watch this of the Mastero . video Naftali Tishby Thank You \(-_- )/

Information Theory of Neural Networks

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

Effective Parallel Computing

It Is Okay If You Don't Know What You Like. We Do (feat. Deep Recommendation Algorithms)

10 Machine Learning, Data Science, and Deep Learning Courses for Programmers in 2020

10 Computer Vision Startups on Product Hunt with the Most Upvotes

10 Best Entry Level Machine Learning Tutorials

10 Best + Free Machine Learning Courses Collection

Effective Parallel Computing

It Is Okay If You Don't Know What You Like. We Do (feat. Deep Recommendation Algorithms)

10 Machine Learning, Data Science, and Deep Learning Courses for Programmers in 2020

10 Computer Vision Startups on Product Hunt with the Most Upvotes

10 Best Entry Level Machine Learning Tutorials

10 Best + Free Machine Learning Courses Collection

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps