This post is inspired by material studied while interning with @jeremyphoward and @math_rachel‘s fast.ai, in particular Lesson 14 of their course Cutting Edge Deep Learning for Coders, taught at USF’s Data Institute. If you’d like to see my end-to-end code for this project, please check out my repository There’s Waldo.
By now, everyone outside of the field likely knows that recent reports of the “Facebook AI Incident” have been greatly exaggerated (Fake News!). That’s an understatement; the reported story is a gross distortion of an otherwise exciting research paper at the hands of horrendous journalists.
No, Skynet has not gained awareness. However, AI continues to progress at a rapid pace. In particular, the field of computer vision has advanced considerably since the resurgence of deep learning; convolutional neural networks have made tasks like image classification and object detection near trivial. And though cyborgs are still the stuff of science fiction, their operative components are not (see self-driving cars). So in a way, we do have the ability today to help a Terminator acquire it’s target.
Today, that target is Waldo. Yes, everyone’s favorite bespectacled wanderer and master of occlusion has gotten himself into hot water; we’re going to look at exactly how to find him using semantic segmentation with the Fully Convolutional DenseNet known as Tiramisu.
For those unfamiliar, Where’s Waldo (or Wally) is a series of children’s books that challenges the reader with finding the eponymous character and his compadres in densely illustrated images.
Here’s an example:
This is one of the more absurdly difficult challenges, but representative of how time consuming and daunting they can be.
A quick Google search turns up a few machine/deep learning solutions to this very same problem. Most notably, Penn Senior Data Scientist Randy Olson’s Optimal Where’s Waldo Strategy determines the optimal search path to finding Waldo given his location in all 68 Waldo images.
Olson’s approach doesn’t actually find Waldo, it tells you the best way to find him based off knowing where he is in all 64 images. This is a great solution to a different task, one that utilizes knowledge of Waldo’s locations a priori. Other approaches just find Waldo in the image, given what he looks like in that image.
Our goal is to find Waldo as humans do. Given a new image and a conceptual understanding of what Waldo is, the model should locate Waldo even though it has never seen him in that picture before.
I approached this task as a semantic segmentation problem. The goal of semantic segmentation is to detect objects in an image; it does this by making per-pixel classifications.
This street image from the Camvid dataset is a standard example of how this works. Every pixel in the image has been labeled as belonging to some class of object, be it a tree, building, car or human.
The task then is to build a model that predicts the class of each pixel. For the purposes of detecting Waldo, there are only two classes for our images: Waldo, and not-Waldo.
This bounding box to the left is indicative of the other 17: they are all squarely around Waldo’s head, and naturally include some of the surrounding background. Traditionally for semantic segmentation, we would only want pixels depicting Waldo to be labelled. Unfortunately I’m not familiar with any easy way to label pixel by pixel, nor did I think it necessary to make that effort.
Once I set these boxes, I built binary label images representing not-Waldo and Waldo. In general these look something like this:
Great! Now I have inputs and targets, essential ingredients to training our model. But before we get to that, we need to address a few problems:
I addressed these problems by dynamically sampling 224 x 224 sub-images from the original 18 online:
The second point is crucial. Tiramisu is fully convolutional: it exploits local information only. This means that we can essentially train a network for the complete, high resolution images (our main goal) by training on carefully managed sample images. Of course the sample images will have edges that don’t exist in the whole image, but the effect is negligible.
About carefully managing sampling: the large amount of possible sample images and small batch size (~6 on Titan X) means I had to make sure there was an appropriate number of images containing Waldo in each batch.
In order to do this I isolated 18 “Waldo Images” for each original. These images are constructed so that every random 224 x 224 sample contains a complete Waldo.
I also made sure that when sampling from the full images, I omitted any that contained Waldo, partially or otherwise. This was helpful to make sure full image sampling produced negatives, and to avoid forcing the network to learn from unhelpful/incomplete positives (e.g. it wouldn’t be useful to learn from an image containing only the tip of Waldo’s hat).
Now that we understand the data generating process I used, let’s talk about the model.
If you’re unfamiliar with the fully convolutional DenseNet known as Tiramisu, I highly recommend reading Jégou et. al’s original paper The One Hundred Layers Tiramisu: Fully Convolutional DenseNets for Semantic Segmentation.
For practical usage and guidance, I also highly recommend taking a look at Jeremy Howard’s implementation as part of the fast.ai course Cutting Edge Deep Learning For Coders Part 2 (Full Disclosure: I annotated this implementation and others with full descriptions as part of my intern work for fast.ai).
If you’re fairly comfortable in this domain, this diagram from the original paper should suffice in explaining the essence of this architecture. Essentially, Tiramisu is a marriage of the U-Net architecture commonly used in semantic segmentation and the advantages of the forward layer-connections found in DenseNet.
I trained this network using RMSProp and categorical cross-entropy loss. Obviously, pixels containing Waldo are sparse. I was able to mitigate the effects of this class imbalance two ways:
After training for about an hour and a half, I achieved some promising results.
It seemed the model made up it’s mind about most classifications fairly quickly. Of course there is a lot of noise, which is consistent with the online data generation.
I suspect convergence would be much smoother had I figured out an easy way to sample from the 142 million possible images without replacement. I didn’t, so the model is constantly seeing entirely new images, especially the negatives sampled from the full image.
Qualitatively, we get very encouraging results!
At this early stage of training, it became apparent that the model had begun to learn where Waldo is while effectively screening out the background.
This is great news! Several previous models were unable to handle the class imbalance and ended up with zero-sensitivity. Seeing these results at this early stage convinced me to keep training this model overnight.
After overnight training, I noticed that while the model had become extremely sensitive it was still having some difficulty with negatives. Given the large sampling space of negatives, I assessed this was because the model wasn’t seeing enough of the full images.
After balancing out the positive-negative batch ratio and training for another 2000 iterations, I finally achieved some stellar results. I was able to visualize my model’s performance by creating a transparency mask from the predictions and overlaying them over the original image.
Yes that tiny speck in the left corner is Waldo (ironically, he’s exactly where Olson suggests you start your search). I want to stress that this is not a rounded binary transparency map, these are raw prediction values. We can see how how confident the model is in it’s predictions. A close up:
I’m very satisfied with this performance, and it is consistent across the entire training set. See for yourself!
A quick glance at the training predictions will tell you a few things.
Wenda is Waldo’s female counterpart:
At most resolutions in the training set, she looks very similar, if not exactly like Waldo. This is a completely reasonable mistake for the network to make. Since Wenda is a negative sample, it’s unlikely her image is sampled that often; when the model does see her, it thinks it’s Waldo. If we were to increase her presence in training, I’m sure the model would learn to ignore her.
Of course, the real task is to see if the network has learned to generalize the concept of Waldo, and is able to find him in a new picture it’s never seen before:
Wow! That’s great!!! I’ve forever solved Waldo.
OK, not really. 2 of the 8 images I tested on resulted in no positives.
And yet, the remaining 6 had accurately and confidently located Waldo, Wenda, or both. Further, all images lacked noise in the negative space; there are false positives, but no unsure “clouds”.
So while this model isn’t perfect, it’s more than enough evidence that this task is solvable using this approach, and more importantly it’s generalizable.
I have no doubt that with:
and lots of free time, somebody can train a bullet-proof Where’s Waldo model.
A few final notes.
There may be some confusion as to how exactly I’m producing these predictions for the entire image, while training on 224 x 224 samples. A fully convolutional network like Tiramisu can work on the full size image in theory of course. Unfortunately in practice, the original image is far too large to load into memory, and downsampling it destroys the fine details needed to classify.
My solution was to resize each image into the next largest dimensions that are divisible by 224, then split it into individual panels, each 224 x 224. I then made predictions on each of these panels, and recombined them together as the final output.
Overall the detriment of doing this as opposed to theoretical prediction on the whole image is negligible because these sub-panels are large enough to contain the information necessary for prediction.
The only time this becomes problematic is when the panelling actually splits Waldo. I’m sure there are ways to avoid this but I found this approach to be good enough.
Speaking of approaches, you might wonder why I didn’t use bounding box regression for this task. Again, these images are just too large, and downsampling destroys information. Not only are they too large but the boxes are likely relatively too small.
What might actually work is splitting into panels, classify the panel globally as containing Waldo or not, and then regress bounding boxes on those that do. But that’s a more complicated approach then this one.
That’s it! I hope you enjoyed this, and if you’re looking for a more in-depth overview of my end-to-end process including code, please check out my repository for this project, There’s Waldo.