This week, my adventures with deep learning introduced me to the concept of autoencoders. And with the help of my Pokemon, you can learn about it too!
An autoencoder is a special type of neural network that takes in something, and learn to represent it with reduced dimensions.
Think of it like learning to draw a circle to represent a sphere. You know it’s actually a sphere, but you decide that it’ll be a good idea to represent it as a circle on your physics notebook, instead of gluing a sphere on the page. When you come back to this page again (hopefully) one day, you’ll look at the circle and treat it as a sphere. You can do this because you’ve subconsciously learnt to autoencode a 3 dimensional sphere as a 2 dimensional circle.
Congratulations, you’re an autoencoder! Actually, you know you’re not limited to circles. You can do cubes. Da Vinci can do Mona Lisa. Essentially, we’re autoencoding our thoughts as 2 dimensional words when we write. You get the idea.
An auto encoder consists of two parts: an encoder, which learns to convert the input ( X ) as a lower dimensional representation ( Z ), and a decoder, which learns to convert the lower dimensional representation back to its original dimensions (Y).
If your encoder is really good, your input and output will be exactly the same. This would be a perfect autoencoder.
In reality though, this is rarely achieved.
An obvious reason is for compression, right? You’re reducing dimension, so you can just pass around the lower dimensional data around. But think about it, and you’ll realize that even though, we have achieved compression, we still have to pass around the decoder for anyone to be able to understand what we’ve encoded in the first place.
Think of it like passing around notes in Mandarin in London.
If you decide to encode the word ‘tree’. Sure, you reduced what would be 4 characters in English to a single character in Mandarin. But in order to understand what you mean, the other person would also need to know Mandarin. Now you begin to see that in order to faithfully pass around 木 (mù) or its cousins, with any hope of Londoners understanding it, you also need to send a copy of a Mandarin handbook for English speakers.
Clearly, this is not efficient as you thought it would be. Okay, so if not for compression, what else do we use it for?
Well, we mentioned that autoencoders are not perfect, and their decoder often does not produce the exact output that was encoded. What if we use it instead to produce similar output as the input?
This means we can use it to generate even more things that are similar to the input. Like produce more images of flowers from a bunch of images of flowers.
A variational autoencoder (VAE) is a special type of autoencoder that’s specifically designed to tackle this. The VAE can introduce variations to our encodings and generate a variety of output like our input.
Okay, time to embark on our Pokemon journey!
The Pokemon we’ll be working with are the Nintendo DS Style bitmap images. You can grab a copy here.
The resolution of each image is 64 pixels x 64 pixels. I thought this would be a good leap from the friendly MNIST dataset that everybody likes to play with — and, this time it would be in color, for a change.
Each pixel is described by the three RGB values. So for each image is described by 64 x 64 x 3 values. This gives us a total of 12288 values.
Also, notice that each Pokemon is unique and fundamentally distinct. There’s a lot of variation. Just look at the first generation Pokemon.
Wow. That’s a lot of variation, right?
But our dataset contains FIVE such generations. And not just their front views, but the rear views too. And two different of each view. Each pokemon has 4 shots. That’s variety!
Okay, enough talk about data. Let’s autoencode.
Our encoder first take all 64 x 64 x 3 images and flattens them so that we have a one dimensional vector of 12288 values. It then proceeds to downsample the vector step by step until we get the desired encoded dimension. Essentially this is a series of matrix multiplication operations. The decoder does the opposite process and tries to reconstruct the original image from the encoded value through a similar series of matrix multiplication operations.
How does the encoder know what matrix to multiply the input with? Well, that’s exactly why we train the autoencoder. Once we specify the dimensions of the encoder and decoder matrices, the autoencoder tries to figure the best values for the matrices so that it can best do its job!
Shower Thoughts: What if a Pokeball is really a perfect auto encoder, which encodes Pokemon (reducing the space and mass taken up by it), into something that can fit inside the ball until summoned by the trainer and decoded back ? :O
I used an encoder that downsampled the images from 12288 to 1024 to 64 and finally to 4 values. So each of our encoded Pokemon will be represented by just FOUR values. All 150 of them and more! Isn’t that awesome?
Okay, so let’s take a look at our input:
Awesome. So you can see a sample of 100 Pokemon here. Like we expected, there’s drastic variation in the backgrounds, camera angles and the features of the Pokemon. I trained my autoencoder for about 250 epoch and here’s the result from the decoder:
Holy smokes, that’s pretty good, huh? Its pretty darn amazing that a lot of Pokemon look decent enough to be identified. Remember we down sampled everything to just FOUR values.
To visualize how our autoencoder trained, I built a gif from the output snapshot from every tenth epoch.
You can see that the autoencoder learnt to match closely till about the 150th epoch (15th frame of the gif, if you have a good eye ;), and plateaued.
But this is awesome! Maybe I should try to generate new Pokemon using a VAE sometime.
For now, I gotta go catch some more!
Note: If you liked this article, and would like to see more like this, all you have to do is to press the little heart button at the bottom, and I’ll know :)
The codebase used for this project was built upon Parag Mital’s excellent course on Creative Applications of Deep Learning. You can find the codebase here. Siraj Raval released a really awesome video on autoencoders earlier this week, which is what inspired me to work on this article.
Isn’t it just terrific that we can learn from each other from half way across the planet?