Deep neural networks have many, many learnable parameters that are used to make inferences. Often, this poses a problem in two ways: Sometimes, the model does not make very accurate predictions. It also takes a long time to train them. This post talks about increasing accuracy while also reducing training time using two very novel ways.

#### EDIT:

This article won the second most popular blog post award in August 2017 by KDNuggets.

The original papers can be found here(**Snapshot ensembles**) and here(**FreezeOut**).

*This article assumes some familiarity with neural networks, including aspects like **SGD**, **minima**, **optimisation**, etc.*

### How this article is structured

I will be talking about two different papers that aim to do different things.

Note that even though there are two different ideas, they are not mutually exclusive and can be used simultaneously.

*This is a long post, but it is divided into two sections which are mutually exclusive*

### 1. Snapshot Ensembling — M models for the cost of 1

#### Regular Ensemble Models

Ensemble models are a **group of models** that work collectively to get the prediction. The idea is simple: **Train several models** using different hyperparameters , and **average the prediction** from all these models. This technique gives a **great** **boost in accuracy** because it is not relying on a single model for prediction. Most winning entries in high profile Machine Learning competitions have used ensembles.

#### So what’s the problem?

Training N different models will require N times the time required to train a single model. Most people who don’t have the luxury of having Multiple GPUs will often have to wait for a long time before they can test out these models. Therefore, it makes experimenting much slower.

#### SGD Mechanics

Before I tell you about the ‘novel’ approach, you must first understand the nature of Stochastic Gradient Descent(SGD). SGD is greedy, it will look for the steepest descent. However, there is one very crucial **parameter** that **governs** SGD —** The ****Learning**** Rate**.

If the** learning rate** is too **high**, **SGD** will **ignore** very **narrow** **crevices**(minima) , and take large steps (think of a tank not being affected by a pothole on the road).

On the other hand, if the **learning rate** is **small**, **SGD** will **fall** inside one of these **local minima** and not be able to come out of it. It is, however, possible to **bring SGD back** from the local minima, by **increasing** the **learning rate.**

#### The trick?

The authors of the paper use this **controllable property** of SGD falling in and climbing out of local minima. **Different local minima** may have very **similar error rates**, but the **mistakes** that they will make will be **different** from **each other**.

They have included a very useful diagram that explains this concept:

#### What is being ensembled a.k.a snapshot?

The authors use the property of local minima having different ‘viewpoints’ on their predictions to create multiple models. Every time SGD reaches a **local minima** , a **snapshot** of that model is **saved**, which will be part of the final ensemble of networks.

**Cyclic Cosine Annealing**

Instead of manually trying to figure out when to dive into a local minima or when to jump out of it, the authors used a function to automate this process.

They used Learning Rate Annealing with the following function:

### Simplified

The formula may look complicated, but its quite simple. They used a monotonically **decreasing** **function**. *α *here is the new learning rate, and α0 is the old learning rate. **T** is the total **number** of training **iterations** you want to use (T should be equal to batchsize*number of epochs). **M** is the **number** of **snapshots** you want in your ensemble.

Notice how the loss falls rapidly just before each snapshot. This is because the learning rate **decreases** **continuously**. After snapshot, the learning rate is **restarted** back (they used the value of 0.1). This causes the gradient path to be brought out of the local minima (and new local minima search begins again).

#### Show me the numbers

I have included the numbers that the authors used to demonstrate the effectiveness of their method

#### Conclusion

This is a useful strategy to get a marginal boost in accuracy at no additional training cost. The paper talks about varying different parameters such as M and T , and how it affects the performance.

### 2. FreezeOut — Training Acceleration by Progressively Freezing Layers

The authors of this paper propose a method to increase training speed by freezing layers. They experiment with a few different ways of freezing the layers, and demonstrate the training speed up with little(or none) effect on accuracy.

#### What does Freezing a Layer mean?

Freezing a layer prevents its weights from being modified. This technique is often used in **transfer learning, **where the base model(trained on some other dataset)is frozen.

**How does freezing affect the speed of the model?**

If you dont want to modify the weights of a layer, the **backward pass** to that layer can be** completely avoided**, resulting in a significant **speed boost**. For e.g. if half your model is frozen, and you try to train the model, it will take about half the time compared to a fully trainable model.

On the other hand, you **still** **need** to **train** the **model**, so if you freeze it too **early**, it will give **inaccurate** **predictions**.

#### What is the ‘novel’ approach?

The authors demonstrated a **way** to **freeze** the layers one by one **as soon as possible**, resulting in fewer and fewer backward passes, which in turn lowers training time.

At first, the entire model is trainable (exactly like a regular model). After a few iterations the first layer is frozen, and the rest of the model is continued to train. After another few iterations , the next layer is frozen, and so on.

**Learning Rate Annealing**

The authors used learning rate annealing to govern the learning rate of the model. The **notably different technique** they used was that they **changed** the learning rate **layer by layer** instead of the whole model. They used the following equation:

#### Equation 2.0 Explanation

The sub *i* denotes the ith layer. So *α* sub *i *denotes the learning rate for the ith layer. Similarly , *t sub i *denotes the number of iterations the ith layer has been trained on. *t *denotes the total number of iterations for the whole model.

This denotes the initial learning rate for the ith layer.

The authors experimented with different values for Equation 2.1

#### Initial learning rate for Equation 2.1

The authors tried scaling the initial learning rate so that each layer was trained for an equal amount of time.

Remember that because the first layer of the model would be stopped first, it would be otherwise trained for the least amount of time. To remedy that, they scaled the the learning rate for each layer.

The scaling was done to ensure all the layers’ weights moved equally in the weight space, i.e. the layers that were being trained the longest(the later layers), had a lower learning rate.

The authors also played with **cubic scaling**, where the value of t sub i is replaced by its own cube.

The authors have included more benchmarks , and their method increases a training speedup of about **20%** at only **3%** **accuracy** drop, and **15%** at **no drop** in accuracy.

Their method does not work very well for models that do not utilize skip connections(such as VGG-16). Neither accuracy not speedups were noticeably different in such networks.

### My Bonus Trick

The authors are progressively stopping each layer from being trained, which they then don’t calculate the backward passes for. They seemed to have **missed** to exploit **precomputing** **layer** **activations**. By doing so , you can even **prevent** calculating the **forward pass.**

**What is precomputation**

This is a trick used in transfer learning. This is the general workflow.

- Freeze the layers you don’t want to modify
- Calculate the activations the last layer from the frozen layers(for your entire dataset)
- Save those activations to disk
- Use those activations as the input of your trainable layers

Since the layers are frozen progressively, the new model can now be seen as a standalone model(a smaller model) , that just takes the input of whatever the last layer outputs. This can be done over and over again as each layer is frozen.

Doing this along with **FreezeOut **will result in a further substantial reduction in training time while not affecting other metrics(like accuracy) in any way.

### Conclusion

I demonstrated 2(and half of my own) very recent and novel techniques to improve accuracy and lower training time by fine tuning learning rates. By also adding pre computation whenever possible, a significant speed boost can be possible using my own proposed method.

### P.S. (Also stands for Please Share)

If you notice any errors or have any doubts, please comment about them. I will update my post or try to explain better.

Also, if you liked my article, please recommend it by pressing on the ❤. It lets me know I was of help to you.