This is part 4 of a 5 article series:
In the previous post, I explored deep neural networks to solve our kitchen classification problem. While they improved on the simple logistic regression model, they exhibited very strong overfitting even with dropout regularization. I hypothesized that this was, at least in part, due to the curse of dimensionality; we have a small amount of very high dimensional data and that reduces the probability of the network finding weights that generalize well.
One solution to this curse is to reduce the dimensionality of the data, but we want to do this in an intelligent way. If we simply reduce the image size we’ll likely be throwing away important information. Instead, we’ll turn to a deep learning architecture called convolutions to first pick out the most important details, then get rid of the rest before passing it on to classification layers.
Convolutions are actually fairly simple to understand conceptually. A convolution will take a very small image (typically less than 5px by 5px) called the filter, and slide that filter across the input image. Spatial relationships are maintained because we do not flatten the input like we did before. To visualize this, consider the below animation:
The yellow filter slides 1 pixel at a time across the green input image, combining their values (in red) into the pink output “convolved feature”. For a given convolution layer, we may do this with many different filters, producing multiple output images. So what does this look like on an actual image? Here’s an animation of it happening with a filter producing edges:
You can see, depending on the filter used, different features are “found” from the input image. Early stage, small filters applied over a large image will find very smalls scale features, like edges. If we stack more convolutions on top of these, progressively more complex and large scale features can be found; from edges, to corners, to eyes, to faces, to people.
The second important step in convolution is to reduce the dimensions of these feature maps, getting rid of the less important information. This is typically done with a function called spatial pooling. The idea of spatial pooling is to define some spatial window, like 2px by 2px, and to combine the values within that window into a single value. A 2px by 2px pool would cut the feature map size in half.
Shown above is max pooling, where we take the maximum numerical value from the window and throw away the rest. This is the most common form of pooling and what I’ll be using in the notebook.
Some of the code structure will look a little different in this notebook for two reasons:
TensorFlow Slim is simply a more condensed syntax for defining layers and variables. For instance, a fully connected layer in TensorFlow:
W_one = tf.get_variable('weights_1', [85 * 85 * 3, 1000], initializer=xavier())
b_one = tf.get_variable('bias_1', , initializer=zeros())
logits_one = tf.add(tf.matmul(inputs, W1), b1)
layer_one = tf.nn.relu(logits1)
becomes much smaller in TensorFlow Slim:
layer_one = slim.fully_connected(inputs, 1000)
With these concepts in mind, here’s the experiment notebook for applying a convolutional network to our kitchen classification problem:
More success! The validation accuracy is now up in the high 70s, reaching nearly 77%. With more experimentation of architectures, low 80s should be achievable. As usual, here’s the tensorboard summary:
In the final notebook, I’m going to try something a little different, a technique known as transfer learning. Check it out!
As a brief aside, I noticed during the last training that the GPU was not being fully utilized. Here is the output of the nvidia-smi tool during training:
You can see we’re only looking at about 9% utilization despite long training times, and this bounced around a lot. This indicates the GPU is being starved for data. Training examples can’t be moved out of CPU memory and onto the GPU fast enough to keep up with how quickly the GPU is burning through calculations. Consequently, we see low utilization while the GPU waits for data to be fed in. The reasons and solutions for this really deserve their own article, but in brief: TensorFlow’s feed_dict input system is really meant for testing and toy examples on small datasets. The appropriate production way of feeding data is to use queue runners. Runners will utilize multiple threads to read data off disk or memory and keep the GPU well fed. The downside is that it adds some code complication. I wont describe their implementation in detail, but hopefully you can follow along with what is happening in the notebook.