Transfer Learning
Transferring knowlege gained by one model, into another. source[1]
This is part 5 of a 5 article series:
- Training an Architectural Classifier: Motivations
- Training an Architectural Classifier: Softmax Regression
- Training an Architectural Classifier: Deep Neural Networks
- Training an Architectural Classifier: Convolutional Networks
- Training an Architectural Classifier: Transfer Learning
In this final article of the series I’ll try using a technique called transfer learning to improve on our accuracy from the previous model using convolutional networks.
Transfer Learning
In short, the idea behind transfer learning is to take a model that has already been trained for a different task and use it as a starting point for your task.
There are essentially two components to a model:
- The mathmatical operations performed by it (its architecture).
- The learned weights gained through training.
1 One typically chooses a known public architecture that has performed well on a benchmark related to their problem domain. One such benchmark for image recognition is the ILSVRC (usually just referred to as ImageNet). ImageNet covers a very broad set of training data over a thousand classes. This makes models that perform well on it ideal for transfer learning because they tend to generalize very well. So if you choose an architecture that is proven on ImageNet, it’s likely to have high accuracy on your problem as well.
2 If you take a copy of that architecture with all of its trained weights learned from ImageNet, you can start your own training from a position much more accurate than random initialization. That means training time will be far shorter than if you were to train it from scratch.
The only work to be done is retraining the model, or some portion of it, for your specific task and output classes.
Implementation
In nearly all deep architectures, the final layer will be a classifier with a number of neurons equal to the number of classes it can predict. To adapt a network for transfer learning, simply cut off that last layer and replace it (and its weights) with your own layer of the appropriate size for your task.
You also have a few choices you can make once you have both architecture and weights. If your own data-set is large enough, and you have the computational resources to run training on your full chosen architecture, you can simply use the pre-trained weight as the initial values and continue training the model as normal. If you lack either of these though, it is worthwhile to explore freezing the weights in the early layers of the network (at least in convolutional networks these are more generalizable), and only allowing the later layers or even just the output layer to undergo gradient descent. This can significantly reduce the computational cost, since you are not having to backpropogate error through the entire network and often yeilds very good accuracy.
The absolute fastest way to do this is to forward propogate each image in your dataset through to the last layer that you want to keep, then save out the activation values produced from that layer (sometimes referred to as “bottleneck” values). These values become your new pre-processed “image” that you can train a very simple classifier on. That way the computational cost of the forward prop step is only charged once for each image, rather than once with each epoch of training.
You can also simply slice the pre-trained network where you like, and then wire new output layers on top of it.
Keras
One deep-learning framework in particular, Keras, makes this very simple and it’s what I’ll be using in the notebook below. You’ll also notice Keras makes life a lot easier in innumerable other ways, from data-loading, to image augmentation, to architecture definition and training; it is a really fantastic library.
To accomplish transfer learning in Keras,
- We’ll import a model from its library (I’ll be using VGG16).
- Hack off the classification layer and penultimate fully-connected layer (you can actually specify this as an import option).
- Run all image examples forward through the network, storing the activation values from the last convolution as a new set of “image” features.
- Build a simple 2 layer fully connected model with a 2 neuron softmax output.
- Train away, using the pre-calculated VGG features as input to the classifier.
Only the last 2 layers that we created will be trained, but the features they are fed will have undergone the full extraction power of the ImageNet trained VGG16 network. See the notebook for code:
And we end up with 91.9**%** accuracy with very little tweaking or training time, this is a huge improvement over our initial 57% accuracy with softmax regression, and with more training, data, and experimentation will only continue to improve.
