Deep neural networks have many, many learnable parameters that are used to make inferences. Often, this poses a problem in two ways: Sometimes, the model does not make very accurate predictions. It also takes a long time to train them. This post talks about increasing accuracy while also reducing training time using two very novel ways. EDIT: This article won the second most popular blog post award in August 2017 by . KDNuggets The original papers can be found ( ) and ( ). here Snapshot ensembles here FreezeOut This article assumes some familiarity with neural networks, including aspects like SGD , minima , optimisation , etc. How this article is structured I will be talking about two different papers that aim to do different things. Note that even though there are two different ideas, they are not mutually exclusive and can be used simultaneously. This is a long post, but it is divided into two sections which are mutually exclusive 1. Snapshot Ensembling — M models for the cost of 1 Regular Ensemble Models Ensemble models are a that work collectively to get the prediction. The idea is simple: using different hyperparameters , and from all these models. This technique gives a because it is not relying on a single model for prediction. Most winning entries in high profile competitions have used ensembles. group of models Train several models average the prediction great boost in accuracy Machine Learning So what’s the problem? Training N different models will require N times the time required to train a single model. Most people who don’t have the luxury of having Multiple GPUs will often have to wait for a long time before they can test out these models. Therefore, it makes experimenting much slower. SGD Mechanics Before I tell you about the ‘novel’ approach, you must first understand the nature of Stochastic Gradient Descent(SGD). SGD is greedy, it will look for the steepest descent. However, there is one very crucial that SGD — . parameter governs The Learning Rate If the is too , will very (minima) , and take large steps (think of a tank not being affected by a pothole on the road). learning rate high SGD ignore narrow crevices On the other hand, if the is , will inside one of these and not be able to come out of it. It is, however, possible to from the local minima, by the learning rate small SGD fall local minima bring SGD back increasing learning rate. The trick? The authors of the paper use this of SGD falling in and climbing out of local minima. may have very , but the that they will make will be from . controllable property Different local minima similar error rates mistakes different each other They have included a very useful diagram that explains this concept: : standard SGD trying to find the best local minima. SGD is made to fall into a local minima, then brought back up, and the process is repeated. This way you get 3(which are labelled 1,2,3) local minima, each with similar error rates, but with different error characteristics Figure 1.0 Left: Right: What is being ensembled a.k.a snapshot? The authors use the property of local minima having different ‘viewpoints’ on their predictions to create multiple models. Every time SGD reaches a , a of that model is , which will be part of the final ensemble of networks. local minima snapshot saved Cyclic Cosine Annealing Instead of manually trying to figure out when to dive into a local minima or when to jump out of it, the authors used a function to automate this process. They used Learning Rate Annealing with the following function: Figure1.0 Simplified The formula may look complicated, but its quite simple. They used a monotonically . here is the new learning rate, and α0 is the old learning rate. is the total of training you want to use (T should be equal to batchsize*number of epochs). is the of you want in your ensemble. decreasing function α T number iterations M number snapshots M=6 , and Budget=300 epochs. The vertical dotted lines indicate a model snapshot. After 300 epochs a total of 6 models were added to the ensemble. Figure1.1 Notice how the loss falls rapidly just before each snapshot. This is because the learning rate . After snapshot, the learning rate is back (they used the value of 0.1). This causes the gradient path to be brought out of the local minima (and new local minima search begins again). decreases continuously restarted Show me the numbers I have included the numbers that the authors used to demonstrate the effectiveness of their method Figure1.2 Error Rates(%) on Cifar10,Cifar100,SVHN and Tiny ImageNet. Blue indicates the authors’ work, and bold indicates the best error rate for that category Conclusion This is a useful strategy to get a marginal boost in accuracy at no additional training cost. The paper talks about varying different parameters such as M and T , and how it affects the performance. 2. FreezeOut — Training Acceleration by Progressively Freezing Layers The authors of this paper propose a method to increase training speed by freezing layers. They experiment with a few different ways of freezing the layers, and demonstrate the training speed up with little(or none) effect on accuracy. What does Freezing a Layer mean? Freezing a layer prevents its weights from being modified. This technique is often used in where the base model(trained on some other dataset)is frozen. transfer learning, How does freezing affect the speed of the model? If you dont want to modify the weights of a layer, the to that layer can be , resulting in a significant . For e.g. if half your model is frozen, and you try to train the model, it will take about half the time compared to a fully trainable model. backward pass completely avoided speed boost On the other hand, you to the , so if you freeze it too , it will give . still need train model early inaccurate predictions What is the ‘novel’ approach? The authors demonstrated a to the layers one by one , resulting in fewer and fewer backward passes, which in turn lowers training time. way freeze as soon as possible At first, the entire model is trainable (exactly like a regular model). After a few iterations the first layer is frozen, and the rest of the model is continued to train. After another few iterations , the next layer is frozen, and so on. Learning Rate Annealing The authors used learning rate annealing to govern the learning rate of the model. The they used was that they the learning rate instead of the whole model. They used the following equation: notably different technique changed layer by layer Equation 2.0: is the learning rate. is the iteration number. denotes the ith layer of the model α t i Equation 2.0 Explanation The sub denotes the ith layer. So sub denotes the learning rate for the ith layer. Similarly , denotes the number of iterations the ith layer has been trained on. denotes the total number of iterations for the whole model. i α i t sub i t Equation 2.1 This denotes the initial learning rate for the ith layer. The authors experimented with different values for Equation 2.1 Initial learning rate for Equation 2.1 The authors tried scaling the initial learning rate so that each layer was trained for an equal amount of time. Remember that because the first layer of the model would be stopped first, it would be otherwise trained for the least amount of time. To remedy that, they scaled the the learning rate for each layer. Figure2.0 The scaling was done to ensure all the layers’ weights moved equally in the weight space, i.e. the layers that were being trained the longest(the later layers), had a lower learning rate. The authors also played with , where the value of t sub i is replaced by its own cube. cubic scaling Figure2.1: Performance vs Error on DenseNet The authors have included more benchmarks , and their method increases a training speedup of about at only drop, and at in accuracy. 20% 3% accuracy 15% no drop Their method does not work very well for models that do not utilize skip connections(such as VGG-16). Neither accuracy not speedups were noticeably different in such networks. My Bonus Trick The authors are progressively stopping each layer from being trained, which they then don’t calculate the backward passes for. They seemed to have to exploit . By doing so , you can even calculating the missed precomputing layer activations prevent forward pass. What is precomputation This is a trick used in transfer learning. This is the general workflow. Freeze the layers you don’t want to modify Calculate the activations the last layer from the frozen layers(for your entire dataset) Save those activations to disk Use those activations as the input of your trainable layers Since the layers are frozen progressively, the new model can now be seen as a standalone model(a smaller model) , that just takes the input of whatever the last layer outputs. This can be done over and over again as each layer is frozen. Doing this along with will result in a further substantial reduction in training time while not affecting other metrics(like accuracy) in any way. FreezeOut Conclusion I demonstrated 2(and half of my own) very recent and novel techniques to improve accuracy and lower training time by fine tuning learning rates. By also adding pre computation whenever possible, a significant speed boost can be possible using my own proposed method. P.S. (Also stands for Please Share) If you notice any errors or have any doubts, please comment about them. I will update my post or try to explain better. Also, if you liked my article, please recommend it by pressing on the ❤. It lets me know I was of help to you.