April 7th 2019

Neural Networks that represent a supervised learning method, requires a large training set of complete records, including the target variable. Training a deep neural network to find the best parameters of that network is an iterative process, but training deep neural networks on a large data set iteratively is very slow. So what we need is that by having a good optimization algorithm to update the parameters (weights and biases) of the network can speed up the learning process of the network. The choice of optimization algorithms in deep learning can influence the network training speed and its performance.

In this article, we will discuss the need for improving the gradient descent optimization technique and we will also discuss the different variants of gradient descent optimization algorithm.

Citation Note: The content and the structure of this article is based on the deep learning lectures from One-Fourth Labs — Padhai

Rest of the article is structured as follows,

**Gradient Descent****Momentum Gradient Descent****Nesterov Accelerated Gradient Descent****Mini-Batch & Stochastic Gradient Descent****Adaptive Gradient Descent — AdaGrad****Root Mean Square Propagation Gradient Descent — RMSProp****Adaptive Moment Estimation — Adam**

Gradient Descent is one of the most commonly used optimization techniques to optimize neural networks. Gradient descent algorithm updates the parameters by moving in the direction opposite to the gradient of the objective function with respect to the network parameters. Gradient descent algorithm for single sigmoid neuron works like this,

Parameter update rule will be given by,

- Initialize the parameters randomly
**w**and**b**and iterate over all the observations in the data. - Calculate the value of the gradient for each parameter (i.e. In which direction we need to move such that loss is reduced).
- Update the value of each parameter based on its gradient value.
- Continue doing step 2 and 3 till loss function gets minimized.

Can we come up with a better update rule?

To understand the gradient descent update rule in a better way, I have taken some toy data and iterated over all the data points for 1000 epochs and computed loss for different values of **w **and **b. **Once I got the loss values for all possible combinations of **w** and **b**, I was able to generate an animation that shows the gradient descent rule in action.

The darker shade of red color in the graph indicates the area where the loss value is a high and darker shade of blue in the valley indicates the lowest point of loss. The points at the bottom indicate the different combinations of **w **&** b** (parameters) and the points on the contour indicate the loss value for the corresponding parameter values.

From the gradient descent loss animation, you would have observed that for initial iterations while the curve is still on the flat light red surface, the **w**and **b** values at the bottom of the contour plot change very little for every epoch. It means that we are taking very smaller steps towards our goal but once our loss curve enters into the valley (light blue area) for every epoch the values of **w** and **b **are changing drastically (black bots are far from each other)

In short, in the gentle areas of the loss surface, the movement is very slow and in the steep areas the movement is fast and again at the gentle areas at the bottom, the movement is very slow.

We have made a pretty interesting observation about the gradient descent update rule but we don’t know why the movement is slow in some parts and fast in some parts of the contour surface.

Let’s take a simple function shown below,

As you can see from the figure, the curve is a combination of very steep region and gentle region. The slope (derivative) of the curve in the steep region evaluates to a value of 3 (change in y = 3, change in x = 1) similarly, the slope of the curve in the gentle region evaluates to a value of 1. So we infer that when the slope is very steep the derivative(gradient) is high and when the slope is gentle the derivative is low.

The derivatives at the steep slopes are larger in magnitude, whereas for the gentle slopes they are smaller in magnitude. Now relating this information to the gradient descent animation, when we are at the plateau with a gentle slope the derivatives at that area will be small. If the derivatives are small then the update of the parameter will also be small.

Hence in the areas where the curve is gentle, the updates are small whereas in the areas where the curve is steep the updates are large

We have figured out the reason why the gradient descent update is moving slowly in some regions and faster in some regions of the contour. what is the problem with this kind of movement in the update rule?

The problem is that in gradient descent, we are initializing parameters randomly and if that so happens that we initialize **w** and **b** in a place where the surface is very gentle then we need to run many many epochs just to come out that flat surface and enter into the slightly steep region from there you will start seeing some improvements in loss function. In order to avoid this kind of scenarios, we need to improve our update rule.

Before we proceed to see some improvements to the gradient descent update rule first we will take a small detour and understand how to interpret contour maps. Contour maps help us to visualize 3D data in two dimensions using contours or color-coded regions. Contour maps tell about the movement of the slope along the particular direction based on the distance between the corresponding contours in the 2D contour map.

A small distance between the contours indicates a steep slope along that direction.A large distance between the contours indicates a gentle slope along that direction.

The 3D representation of gradient descent error surface looks like this,

We can notice that from the above figure as the curve reaches the steeper surface it takes larger and larger steps towards convergence represented with sparse points. Once the curve reaches the dark blue flat region the slope in that direction is gentle that means the distance between the consecutive contours in the dark blue region would be less.

As you can see from the animation once the curve reaches the cluster of light blue contours where the distance between consecutive contours is less it takes larger and larger steps towards convergence.

From our discussion on gradient descent, it is clear it takes a lot of time to navigate regions having a gentle slope because the gradient in these regions is very small. How can we overcome this?

Let’s consider a situation where you are going to a recently opened shopping mall in an unknown area. While you were trying to locate the mall you have asked multiple people about the location of the mall and everyone directed you to go towards the same location. Because everyone is pointing you towards the same direction you will go faster and faster in that direction with more confidence. Now we will use the same intuition in momentum based gradient descent,

In Momentum based gradient descent update rule, we also included the history component vₜ it stores all the previous gradient movements till this time t.

At every time step or every update instead of just moving by the value of the current gradient like the Vanilla GD. In Momentum GD, we are moving with an exponential decaying cumulative average of previous gradients and current gradient. Now let’s see how Momentum GD would perform on the same error surface,

From the animation, it is evident that Momentum based gradient descent is moving faster than the vanilla GD on the flat surface because it is not only using the current gradient value to update the parameters but also a history of gradients value till that point. By the time it has taken around 5 to 6 steps on the flat surface, it’s history would have grown up to a certain extent, allowing the Momentum gradient descent to take larger and larger steps on the flat surface.

Is moving fast always good?

To analyze the limitations of Momentum based gradient descent on a different error surface, I have taken some other toy data set and used single sigmoid neuron to find the loss value at different combinations of **w** and **b **to generate 3D loss surface which is having narrower minima surface.

Momentum-based gradient descent oscillates in and out of minima because it would have accumulated more history by the time it reaches minima resulting in taking larger and larger steps evidently leads to overshooting the objective. Despite the u-turns, it still converges faster than the vanilla gradient descent.

How can we avoid the overshooting of minima?.

In Momentum based gradient descent we are making one movement in the direction of gradient history and another in the direction of the current gradient because of this two-step movement we are overshooting the minima. Instead of moving two steps at a time, why not first move a little in the direction of gradient history, compute the gradient at that point and update the parameters.

In essence, what we are doing in Nesterov Accelerated Gradient Descent is that we are looking forward to see whether we are close to the minima or not before we take another step based on the current gradient value so that we can avoid the problem of overshooting.

We can see that all the oscillations made by the Nesterov Accelerated Gradient are much smaller than that of Momentum based Gradient Descent. Looking ahead helps NAG in correcting its course quicker than Momentum based Gradient Descent. Hence the oscillations are smaller and the chances of escaping the minima valley are also smaller.

Should we use entire data for computing the gradients?

Consider that you have a huge dataset with million data points if we use the batch gradient, the algorithm will take a million calculations (compute derivates for each of the million points and accumulate all these derivates) and then make one tiny update to our parameters.

Can we do Something better?

Instead of looking at all data points at one go, we will divide the entire data into a number of subsets. For each subset of data, compute the derivates for each of the point present in the subset and make an update to the parameters. Instead of calculating the derivative for entire data with respect to the loss function, we have approximated it to fewer points or smaller batch size. This method of computing gradients in batches is called the **Mini-Batch Gradient Descent**. If we have one million data points with a batch size of 10, we will have 100,000 batches and we would have made 100,000 updates to the parameters.

If we take the idea of mini batch Gradient descent to the extreme and look one point at a time, compute the derivative and make an update to the parameters for every point. This method is called the **Stochastic Gradient Descent**.

In Stochastic we update the parameters for each point, the points are not working in consensus. Every point is making the greedy decision and updates the parameters best suitable for that point alone which may be the case for the remaining points. We can reduce the problem of oscillations is by considering a subset of data or batch of data and update the parameters after calculating the derivates for each point within that batch.

We can see the oscillations have reduced in Mini-Batch gradient descent and oscillations are completely contained within the blue curve. In practice, we choose the batch size equal to 16, 32 and 64. Whether we are using Mini-Batch or Stochastic gradient descent key point to note is that we are approximating the derivatives with respect to loss function, we are not computing the true derivatives of the loss function.

Number of updates (steps) made to the parameters in one epoch,

- Batch Gradient Descent — 1
- Stochastic Gradient Descent — N(Number of data points)
- Mini-Batch Gradient Descent — N/B(B = Batch Size)

AdaGrad — Gradient Descent with Adaptive Learning Rate

The main motivation behind the AdaGrad was the idea of Adaptive Learning rate for different features in the dataset, i.e. instead of using the same learning rate across all the features in the dataset, we might need different learning rate for different features.

Why do we need Adaptive Learning rate?

Consider a dataset that has a very important but sparse variable, if that variable is zero in most of the training data points the derivative proportional to that variable will also be equal to zero. If the derivative is equal to zero then the weight update is going to be zero.

If our parameters (weights) are not moving towards the minima then the model will not make optimal predictions. To aid such sparse features we want to make sure that whenever that feature value is not zero whatever is the derivative at that point it should get boosted by a larger learning rate.

If a particular parameter is getting updated frequently that means the derivative is not zero most of the times, for such parameters we want to have a smaller learning rate in contrast if we have a sparse parameter which will be off (zero) most of the times that mean the derivative will be zero most of the times. For such feature whenever this feature is on (not zero) we want to boost the gradient update with a higher learning rate.

In AdaGrad, we are dividing the learning with the history of gradient value until that point. Non-sparse features will have a large history value because they would be getting frequent updates, by dividing the learning rate with the large history, the effective learning rate would be very small. In the case of sparse features, gradient history value would be very less leading to large effective learning rate.

Consider that we have data with two features **w **(sparse feature)** **and **b**which means **w** undergoes fewer updates. Now we will compare AdaGrad update rule with Vanilla GD (black curve), Momentum GD (red), NAG (blue) and AdaGrad (green)

From the above plot, we can see that AdaGrad is taking larger steps in the direction of **w **even though it is a sparse feature in contrast to the other variants because of its gradient value is boosted by the learning rate. The disadvantage of AdaGrad is that the learning rate decays very aggressively as the denominator grows which is not good for parameters corresponding to dense features. As **b** gets updated, again and again, denominator increases a lot and the effective learning rate becomes close to zero, as a result, AdaGrad is no longer able to move in the direction of **b** and gets stuck close to the convergence.

RMSProp — Root Mean Square Propagation

To prevent the denominator from growing rapidly why not decay the denominator and prevent its rapid growth?

RMSProp uses this intuition to prevent the rapid growth of the denominator for dense variables so the effective learning rate doesn’t become close to zero.

In RMSProp history of gradients is calculated using an expoential decaying average unlike the sum of gradients in AdaGrad, which helps to prevent the rapid growth of the denominator for dense features.

Since the denominator for dense features ‘**b**’ is not decaying as aggressively as AdaGrad, RMSProp is able to move in the direction of **b** eventually leading to convergence.

The name Adam is derived from adaptive moment estimation

In Momentum based gradient descent we are using the cumulative history of gradients to move faster in the gentle surfaces and we have seen RMSProp which is also using history to decay the denominator and prevent its rapid growth. The way these algorithms uses the history is different, In Momentum GD, we use history to compute the current update whereas in case of RMSProp history was used to adjust the learning rate (shrink or boost).

Adam combines these two separate histories into one algorithm.

Adam maintains two histories, ‘mₜ’ similar to the history used in Momentum GD and ‘vₜ’ similar to the history used in RMSProp. In practice, Adam does something known as bias correction. It uses the following equations for ‘mₜ’ and ‘vₜ’,

Bias correction ensures that at the beginning of the training updates don’t behave in a weird manner. The key point in Adam is that it combines the advantages of Momentum GD (moving faster in gentle regions) and RMSProp GD (adjusting learning rate).

Because Adam has a property of Momentum GD, we can see that it overshoots the minima and comes back, makes few u-turns before convergence.

In this post, we have looked at the batch gradient descent, the need to develop new optimization techniques, and then we briefly discussed how to interpret contour plots. After that, we have looked at behind six different optimization techniques and three different data strategies (batch, mini-batch & stochastic) with an intuitive understanding which helps to know where to use any of these algorithms. In Practice Adam optimizer with mini-batch of sizes 32, 64 and 128 is the default choice, at least for all the image classification tasks which deal with CNN and large sequence to sequence models.

Backpropagation is the backbone of how neural networks learn what they learn. If you are interested in learning more about Neural Networks, check out the Artificial Neural Networks by Abhishek and Pukhraj from Starttechacademy. This course will be taught using the latest version of Tensorflow 2.0 (Keras backend).