Uncovering the Intuition behind Capsule Networks and Inverse Graphics

Does the Brain do Inverse Graphics? — Geoffrey Hinton

1. Introduction

‘Capsule Networks’ and ‘Inverse Graphics’ seem like intimidating and somewhat vague terms when heard for the first time. These terms weren’t prevalent in mainstream media until recently, after the godfather of deep learning, Geoffrey Hinton, came out with two papers on Dynamic Routing between Capsules and on Matrix Capsules with EM Routing [This is currently a blind submission under review for ICLR 2018 but let’s be honest, we know it’s going to be Hinton et al.].

In this article, I will try to distill these ideas and explain the intuition behind them and how these are bringing machine learning models in computer vision one step closer to emulating human vision. Starting with the intuition behind CNNs, I’ll dive into how they arise from our hypotheses about the neuroscience behind human sight and how inverse graphics is the way to create the next generation of computer vision systems and finally give a brief overview of how all of this connects to Capsule Networks.

2. The first breakthrough idea: Hierarchy

Research about the neuroscience and human sight led us to realize the fact that humans learn and analyze visual information hierarchically. Babies first learn to recognize boundaries and colors. They take this information to recognize more complex entities like shapes and figure. Slowly they learn to go from circles to eyes and mouths to entire faces.

When we look at an image of a person, our brain recognizes two eyes, one nose, and one mouth: it recognizes all these entities which are present in a face and you think “This looks like a person”.

This was the initial intuition for the origin of deep neural networks when they were first architected in the 1970s. These networks were architected to recognize low-level features and build complex entities from them one layer at a time.

3. The second breakthrough idea: Positional Equivariance

Figure 1: Recognizing a cat should be the same learning process irrelevant of its position

Invariance vs. Equivariance

What does this mean? If we have translational invariance (which is true for the CNNs that we use right now), then these two images will be both predicted as cats. That is, the position of the image does not (and should not) affect what we classify the image as.

The concept of Equivariance is similar to invariance, except that, in addition to having the classification irrelevant to the position, we also want to predict where the object is: i.e in addition to detecting that it is a cat, we want the network to be able to detect if it is a cat on the left side, or a cat on the right side.

Let’s say we have this hierarchical network that can detect cats. You wouldn’t want to have one set of nodes try to learn to recognize a cat at one specific area in the image and another set trying to learn the same cat but somewhere else.

Convolutions

Enter Convolutional Neural Networks (You can read more here if this is new to you)! These have small kernels that analyze local regions of an image to try to recognize features. The revolutionary idea was to use the same kernel all over the image to detect the occurrence of the same feature in multiple locations. This made the systems perform better and also faster due to the reduction in parameters through sharing them over all locations in the image.

In 2012, Hinton, with Ilya Sutskever and Alex Krizhevsky created AlexNet: a deep convolutional neural network, which performed phenomenally on ImageNet.

CNNs soon became synonymous with Computer Vision and became applied to all major tasks: from Object Detection and Image Classification to Segmentation, Generative Models and much more.

4. The First Fall: MaxPool did not work for modern day problems

“The pooling operation used in convolutional neural networks is a big mistake and the fact that it works so well is a disaster.” — Hinton

Figure 2: Max Pooling with a 2x2 Kernel and 2 Stride

If MaxPool was a messenger between the two layers, what it tells the second layer is that ‘We saw a high-6 somewhere in the top left corner and a high-8 somewhere in the top right corner.

Figure 3: A CNN with MaxPool would classify both images as human faces because it would detect all the required features. Because of MaxPool, it never learnt any spatial relation between these elements. [Source]

The first implementation of Convolutional Networks was in the 1980s by Kunihiko Fukushima — who architected a deep neural network called the Neocognitron with Convolutional layers (which embodied translational equivariance) with a pooling layer after each convolutional layer (to allow for translational invariance). Fukushima used MaxPool at that time and eloquently explained the intuition behind it in his paper. The initial idea of a pooling layer made sense back then because the task they were trying to solve was recognizing handwritten digits. And we still kept on with the remnants of a distant past. It was about time we did something about it.

Not only do we detect all parts that make up the whole, as humans, we also need for all these elements to be spatially related to each other. However, MaxPool slowly strips off this information to create translational invariance.

Sure, you can detect basic features in the first layer and attempt to send that to the next layer of nodes to detect more complex objects. But how do you decide how to transmit this information between these two sets of nodes?

And here lies the most important piece of the puzzle: Routing. What is Routing?

Routing: The strategy or protocol used to send information from nodes in one layer to nodes in the next layer in a hierarchical learning system (aka Deep Neural Network)

MaxPool seemed like a hack which solved the problem, performed well and was used as the standard. But the ghosts of our past came back to haunt us. By representing each 2x2 block with one number, it allowed for some translational invariance in where the feature could be detected and lead to the same output. What if the face was slightly off-center? The eyes could be a bit more towards the left edge, but these minute changes shouldn’t affect our predictions. Right?

But MaxPool was far too lenient with how much translational invariance it allowed.

Figure 4: Messing around with the original image actually improved the confidence. Seems fishy. [Source]

Well, maybe it is a bit too lenient, but we do generalize to different orientations right?

Figure 5: Clearly our network isn’t generalizing to different orientations.

Not really. The real issue wasn’t that MaxPool wasn’t doing its task well: it was great at translational invariance.

MaxPooling (or subsampling) allows our models to be invariant to small changes in viewpoint.

Today, our tasks span a number of areas, where a majority of the time we are training our models on real life 3D images. Just translational invariance is not going to suffice anymore_._ What we are looking for now is ‘Viewpoint Invariance’.

Figure 6: An excerpt from Hinton’s paper demonstrating the same object observed from different viewpoints

And here is where we dive into the basics of the math behind viewpoints: Inverse Graphics.

5. Distilling Viewpoints and Inverse Graphics

While trying to produce a 2D image from a 3D scene from the viewpoint of a camera (similar to taking a picture — this is called rendering: which is used to create computer games and movies like Avatar), the rendering engine needs to know where all objects are w.r.t. the camera (aka the Viewpoint). However, we wouldn’t want to define all objects relative to the camera. We’d rather define them in our coordinate frame, and then render them from any camera viewpoint we want.

In addition, while creating these graphics, you might want to, for example, define the location of the eye relative to the face, but not necessarily relative the entire person. Essentially, you would have a hierarchy of parts creating a whole object.

Pose matrices help us define camera viewpoints for all objects and also represent the relation between the parts and the whole.

Pose Matrix: The essence of graphics

A pose matrix (also called a transformation matrix) is a 4x4 matrix which represents the properties of an object in a coordinate frame. This matrix represents the 3-dimensional translation (x-y-z coordinates with respect to the origin), scale, and rotation.

Figure 7: A pose matrix. The T-values represent the translation, while the R-values represent the rotation and scale of the image.

Those of you who have experience in 3D Modeling or Image Editing, you know exactly what I’m talking about. These concepts existed in standard computer graphics for decades, but somehow had escaped the grasp of machine learning.

Note: It is not essential to understand what the elements of the pose matrix mean. If you want, you can read more about the math behind pose matrices here.

Figure 8: Pose matrices representing hierarchical relationships. M represents where the face is on the person, while N represents where the mouth is on the face.

A whole is made of its parts, and each part is related to the main object via a pose matrix. The pose matrix represents the smaller part in the coordinates of the whole object. And here comes the most essential part of pose matrices: If M is the pose matrix for a face w.r.t a person, and N is the pose matrix for the mouth w.r.t to the face, we can get the coordinates of the mouth w.r.t the person (i.e. its pose matrix w.r.t the person) as N’= MN.

Aside: Think about a pose matrix as relative velocity. If A is 5 m/s faster than B, and B is 5 m/s faster than C, then we can say that A is 5 + 5 = 10 m/s faster than C. Just as we can add the two numbers to calculate the relative speed, we can multiply the two pose matrices to get the pose matrix of the mouth relative to the person.

Now if we have a camera, and we know that in the frame of the camera, the person’s pose matrix is P, we can extract the pose matrix and thus all essential properties of every part of the person by multiplying pose matrices. In the above example, the pose matrix for the face in the frame of the camera would be given by M’ = PM. This is how all rendering engines used for games and movies function under the hood.

Figure 9: Yellow highlighted text is the pose of the objects w.r.t the camera viewpoint

This pose matrix represents the different viewpoints we can look at the object from. All features of a face are the same, all that differs is the pose of the face from your viewpoint. All viewpoints of all other objects can be derived from only knowing P.

Figure 10: Example of a coordinate conversion from one frame to another, described by the 4x4 matrix. Here Xw is a coordinate in the person’s frame, and the 4x4 matrix is P. Xc is the same coordinate in the camera’s frame.

Distilling Viewpoints and Inverse Graphics

Inverse graphics is going in the opposite direction to what we talked about above. Hinton believes that the brain works in this sort of way. Looking at a 2D image, it tries to estimate the viewpoint through which we are looking at a virtual 3D object.

Figure 11: When you look at these images, do you imagine a 3D chair like that in front of you? Can you rotate the image in your head and visualize how the chair would look from a top view? This is what Hinton was getting at with Inverse Graphics representing the human function of sight.

Now, we can combine hierarchical recognition and viewpoint invariance to dive into how this system actually works.

Estimating the Inverse Pose Matrix

Figure 12: Estimating the pose of the face from the pose of the left eye

Given a pose for the mouth, you can estimate the pose for the face (or in other words, if I tell you where the left eye is, you can imagine where the rest of the face would be, right?). Similarly, we can estimate the pose for the face from the pose for the mouth. If you remember the images of Kim we had earlier on, if we have a normal straight image, both the estimates for the face pose from the mouth and the left eye are similar: we can confidently say that they belong to the same face and thus are related. Similarly, even in the upside down image, both the upside-down mouth and upside-down left eye hint that the face should be upside down. Thus we assign both features to be part of the same whole.

Figure 13: If we can estimate Ev and Mv from the image (through ML models), we can multiply them with the inverse pose matrices (E and M) to get an approximate pose matrix for the face

Figure 14: Non-agreement on the position (pose) of the face. Left: The actual image. Right: Pose estimates for the face based on the mouth and the left eye.

For the distorted image of Kim above, the mouth gives a hint of the face in the top corner while the left eye says the face should be at the bottom of the image. That doesn’t seem to match. So we would have lesser agreement between these features, and they shouldn’t be considered to be parts of the same whole. Because this agreement only happens when they are placed in the correct locations w.r.t each other, this theory leads the network to learn relative spatial positioning. In this case, it would learn that a mouth should be below and between the two eyes for it to be part of a face, instead of recognizing a face just by the existence of a mouth and an eye.

Pixels might change drastically upon changing a viewpoint but the pose matrix changes linearly.

This allows us to model spatial relations using linear transformations, enabling us to generalize to multiple viewpoints and represent information hierarchically by design.

And this is what the two papers Dynamic Routing in Capsules and Matrix Capsules with EM Routing explore.

Aside: There has been some work done earlier in using Inverse Graphics in machine learning: DC-IGN (Deep Convolutional Inverse Graphics Network) by Kulkarni et al.

6. A peek into Dynamic Routing with Capsules

While current networks have nodes that output scalar values (activation for the feature), capsule networks replace them with capsules which output the activation in addition to a vector/matrix which encapsulates information about the feature as well. This could be the position, rotation, scale, thickness of stroke, or anything else that you can think of.

I know you’re thinking ‘Why don’t we just use 8 convolutional layers instead of this 8D output from a capsule?’ The next few sections would make it clearer.

Figure 15: The network finds the vertical and horizontal lines, and a hint of the diagonal line. It found all features of the 7 and some of the features of the 4. The text highlighted in yellow are the capsule outputs

In a CNN, both of the horizontal and vertical lines have a high weight towards both the numbers: it realizes that they can be part of the 7 and and the 4. Even though this image clearly does not have a 4, a CNN would give the 4 a decently high probability.

Now let’s see how a capsule network would work:

A capsule network has each capsule outputting a vector with the information about the position, scale, and rotation about the feature.

Step 1:

When the capsule for 4 receives its inputs: a vertical line, a misplaced horizontal line, and a barely detected diagonal line, the capsule for the 4 outputs a low activation.
When the capsule for 7 receives its inputs: an ideally-placed vertical line, an ideally-placed horizontal line, and a barely detected diagonal line, the capsule for the 7 outputs a high activation.

Figure 16: Pose estimates for the 4 from the poses of the features detected in the image. All estimates are different from each other.

Dynamic Routing

The power of a capsule network comes from Dynamic Routing. Over multiple epochs, our network learns to detect different features through its many nodes and develops a general idea of how they are related (for eg, a vertical line can belong to a 4 or a 7 or a 1 or even a 9). What dynamic routing controls is whether the vertical line should send information to the 4 or the 7 — which one is it a part of in this context? This is a calculation that happens during each iteration to route information from capsules in a layer to capsules in the next layer. Each capsule in layer L has a coupling strength c with each capsule in layer L+1, which represents the likelihood of that part belonging to the particular whole.

While normal forward propagation has standard weights to pass information:z = W * a, with the coupling strength, it becomes z = c * W * a with (c < 1)

Step 2:

Now, the features (initial feature capsules) look at the capsules for the next layer.
The vertical line looks at the 7 and the 4, and goes like _“Woah, the 4 barely has any activation, and its approximate position is different from what I predicted, but the 7 looks like its exactly where I predicted it to be if the image was a 7 and because it has a high activation, the other features agree with me on that too!”._So it increases its coupling strength towards 7 and reduces it towards 4. The same thing happens for the horizontal line.
Now that we have new coupling strengths, we recalculate our estimates for the pose and activation of 7 and 4. All capsules have a lower coupling strength to 4, so the activation for 4 would be even lower this time.

Hinton in his paper repeats Step 1 and Step 2 three times for each pair of capsule layers.

Through Dynamic Routing, lower-layer capsules get feedback from higher-layer capsules about what to pay attention to.

Note: If you’ve heard about the EM or Expectation-Maximization algorithm, can you see how Step 1 and Step 2 represent the M and E step? ;)

In general training of the capsule network, each capsule is not enforced to record specifically the pose and translational properties. It is free to use the vector output to encapsulate whatever it wants. So how do we ensure that it uses the vector representations to capture the properties of the object (like scale and position) as well?

Reconstruction

Here’s where we use the ability to reconstruct. Let’s look at the images of the cat again.

Figure 17: Two cats: what do you get from this image?

If I asked you to look at both these images and try to draw them again: if all you got from the image was that it was a cat, you might just draw a cat in the middle of the canvas for both the images. However, if you wanted your drawing to be close to the original one, when you look at the image, you would also try to remember where the cat was and how big it was to be able to draw an accurate picture.

Hinton enforced the same principle with Reconstruction.

Figure 18: Training the capsule network. The Encoder is the capsule network while the decoder is a set of fully connected layers which take the output of the capsule network (aka the compressed representation)

The output of a capsule network is the activation for each capsule along with a vector. While training, in addition to the Cross-Entropy loss for correct classification, Hinton also added a reconstruction loss: on passing the vector-output of the correct class through a decoder, how off was the recreated image from the correct one?

This automatically forced the capsule layers to be able to learn and represent the positions and properties of the images in addition to just their activations.

Figure 19: Each of these rows represent the perturbations along one of the dimensions of the vector output for different digits.

Wrapping it all together

With this, we can see that the lower layer features need to have similar estimates for more complex entities (like digits). The distorted face that we saw earlier would not be detected as a face at all! More so, if we pass the image of the cat on the left and the cat on the right, the activations and every other component of the vector output would be similar, except for the x-position vector, and this could be used to reconstruct both the images of the cat from just their respective encodings, without having to go through any hacks. This has also been shown to generalize very well to 3D objects viewed from different angles, and 2D objects with affine transformations. Each of the techniques seems decent on its own, but when put together, they become something with a potential to revolutionize machine learning for the years to come.

Bonus Extra: Coordinate Addition

As we discussed, the network can learn whatever features it seems are most suitable. We can tune it to recognize particular features through a technique called Coordinate Addition on top of the reconstruction process.

To test if you have a perfect idea of the (x, y) coordinate of the cat if you can reconstruct the initial image properly, I can ask you to draw a cat at (x + 10, y + 5). If your output matches the same image of the cat shifted by that amount, then I know that you learnt to store the position as an (x, y) pair.

Similarly with the reconstruction, let’s say we have (in case of the Dynamic Routing paper), a 16-dimensional vector representing the correct output class. It reconstructs to produce the original image. Now, we can add a small dx and dy to the first two coordinates, and test the network’s ability to reconstruct the image shifted by dx, dy pixels. The only way for the network to be consistent now is to use the first two elements to represent the position of the entity that was detected, without affecting what it was and where it was detected.

Similarly, coordinate addition can be used in all capsule layers to tune all capsules to keep track of spatial information. In Hinton’s second paper on Matrix Capsules with EM Routing, they implemented Coordinate Addition by adding the position of the center of the kernel’s receptive field to the first two elements of the output pose matrix.

This was shown to perform much better than networks without any coordinate addition (with a 1.8% test error rate compared to 2.6% without coordinate addition while using capsules with matrix outputs).

7. Conclusion

Inverse Graphics seems to be a pretty accurate model of human sight based on our current knowledge about this space. While hierarchical learning and parameter sharing have been around for a few years, the proof-of-concept of Inverse Graphics in Computer Vision opens many new avenues for development.

These have been shown to have state of the art performance on standard datasets like MNIST (with a 99.75% test accuracy) and SmallNORB (with a 45% reduction in error from the previous state of the art). However, the applications and performance of these networks on real and more complex data has not been verified. But one very important benefit that capsule networks provide is taking a step to move from black-box neural networks to those that represent more concrete features that can help us analyze and understand what these are doing under the hood. (If you’ve seen the Black Mirror episode on Neural Networks, you would know how crucial it is to be able to understand this black box interpretation. It freaked me out.)

This concludes the first part of this series on inverse graphics and dynamic routing. I hope this article helped demystify the concepts around Inverse Graphics and how they tie together human and computer vision. In Part II, which would be out soon, I will talk more about how this is implemented in Capsule Networks through Dynamic Routing and EM Routing.

If you enjoyed reading this article, please be sure to give me a clap (or more if you’d like) and follow me to know when I post the second part to this!

Acknowledgements and Further Readings

The paper on “Dynamic Routing Between Capsules”, Sabour, Hinton, Frosst. The first and most well-known paper demonstrating the usability of Capsule Networks.
The paper on “Matrix Capsules with EM Routing”, which is currently an Anonymous submission for the double blind review at ICLR 2018
“Does the Brain do Inverse Graphics”: Hinton et al., University of Toronto
The Deep Convolutional Inverse Graphics Network (DC-IGN on Arxiv). One of the first papers which trains a network to perform inverse graphics inferences.
A great article giving another perspective at Capsule Networks: Kendrick Tan: Capsule Networks Explained
Inspired from Max Pechyonkin.
The origins of CNNs and Pooling: The Neocognitron, Kunihiko Fukushima
Thanks to Rishab Mehra, Shagun Goel, and Vikul Gupta for crucial feedback.