This is part 5 of a 5 article series:
In this final article of the series I’ll try using a technique called transfer learning to improve on our accuracy from the previous model using convolutional networks.
In short, the idea behind transfer learning is to take a model that has already been trained for a different task and use it as a starting point for your task.
There are essentially two components to a model:
1 One typically chooses a known public architecture that has performed well on a benchmark related to their problem domain. One such benchmark for image recognition is the ILSVRC (usually just referred to as ImageNet). ImageNet covers a very broad set of training data over a thousand classes. This makes models that perform well on it ideal for transfer learning because they tend to generalize very well. So if you choose an architecture that is proven on ImageNet, it’s likely to have high accuracy on your problem as well.
2 If you take a copy of that architecture with all of its trained weights learned from ImageNet, you can start your own training from a position much more accurate than random initialization. That means training time will be far shorter than if you were to train it from scratch.
The only work to be done is retraining the model, or some portion of it, for your specific task and output classes.
In nearly all deep architectures, the final layer will be a classifier with a number of neurons equal to the number of classes it can predict. To adapt a network for transfer learning, simply cut off that last layer and replace it (and its weights) with your own layer of the appropriate size for your task.
You also have a few choices you can make once you have both architecture and weights. If your own data-set is large enough, and you have the computational resources to run training on your full chosen architecture, you can simply use the pre-trained weight as the initial values and continue training the model as normal. If you lack either of these though, it is worthwhile to explore freezing the weights in the early layers of the network (at least in convolutional networks these are more generalizable), and only allowing the later layers or even just the output layer to undergo gradient descent. This can significantly reduce the computational cost, since you are not having to backpropogate error through the entire network and often yeilds very good accuracy.
The absolute fastest way to do this is to forward propogate each image in your dataset through to the last layer that you want to keep, then save out the activation values produced from that layer (sometimes referred to as “bottleneck” values). These values become your new pre-processed “image” that you can train a very simple classifier on. That way the computational cost of the forward prop step is only charged once for each image, rather than once with each epoch of training.
You can also simply slice the pre-trained network where you like, and then wire new output layers on top of it.
One deep-learning framework in particular, Keras, makes this very simple and it’s what I’ll be using in the notebook below. You’ll also notice Keras makes life a lot easier in innumerable other ways, from data-loading, to image augmentation, to architecture definition and training; it is a really fantastic library.
To accomplish transfer learning in Keras,
Only the last 2 layers that we created will be trained, but the features they are fed will have undergone the full extraction power of the ImageNet trained VGG16 network. See the notebook for code:
And we end up with 91.9% accuracy with very little tweaking or training time, this is a huge improvement over our initial 57% accuracy with softmax regression, and with more training, data, and experimentation will only continue to improve.