Learn how tensorflow or pytorch implement optimization algorithms by using numpy and create beautiful animations using matplotlib In this post, we will discuss how to implement different variants of gradient descent optimization technique and also visualize the working of the update rule for these variants using matplotlib. This is a follow-up to my previous post on . optimization algorithms Citation Note: The content and the structure of this article is based on the deep learning lectures from One-Fourth Labs — AI. Padh Gradient Descent is one of the most commonly used optimization techniques to optimize neural networks. Gradient descent algorithm updates the parameters by moving in the direction opposite to the gradient of the objective function with respect to the network parameters. https://hackernoon.com/demystifying-different-variants-of-gradient-descent-optimization-algorithm-19ae9ba2e9bc Implementation In Python Using Numpy Photo by Christopher Gower on Unsplash In the coding part, we will be covering the following topics. Sigmoid Neuron Class Overall setup — What is the data, model, task Plotting functions — 3D & contour plots Individual algorithms and how they perform Individual Algorithms include, Batch Gradient DescentMomentum GDNesterov Accelerated GDMini Batch and Stochastic GDAdaGrad GDRMSProp GDAdam GD If you want to skip the theory part and get into the code right away, https://github.com/Niranjankumar-c/GradientDescent_Implementation?source=post_page-----809e7ab3bab4---------------------- Before we start implementing gradient descent, first we need to import the required libraries. We are importing from provides some basic 3D plotting (scatter, surf, line, mesh) tools. Not the fastest or most feature complete 3D library out there, but it ships with Matplotlib. we are also importing and from Matplotlib. We would like to have animated plots to demonstrate how each optimization algorithm works, so we are importing and to make graphs look good. To display/render content in-line in jupyter notebook import HTML. Finally, import for computation purposes which does the most of our heavy lifting. Axes3D mpl_toolkits.mplot3d colors colormap(cm) animation rc HTML numpy mpl_toolkits.mplot3d Axes3D matplotlib.pyplot plt matplotlib cm matplotlib.colors matplotlib animation, rc IPython.display HTML numpy np from import import as from import import from import from import import as View codes on Github Sigmoid Neuron Implementation To implement the gradient descent optimization technique, we will take a simple case of the sigmoid neuron(logistic function) and see how different variants of gradient descent learns the parameters ‘ ’ and ‘ ’. w b Sigmoid Neuron Recap A sigmoid neuron is similar to the perceptron neuron, for every input xi it has weight wi associated with the input. The weights indicate the importance of the input in the decision-making process. The output from the sigmoid is not 0 or 1 like the perceptron model instead it is a real value between 0–1 which can be interpreted as a probability. The most commonly used sigmoid function is the logistic function, which has a characteristic of an “ ” shaped curve. S Sigmoid Neuron Representation (logistic function) Learning Algorithm The objective of the learning algorithm is to determine the best possible values for the parameters ( and ), such that the overall loss (squared error loss) of the model is minimized as much as possible. w b We initialize and randomly. We then iterate over all the observations in the data, for each observation find the corresponding predicted the outcome using the sigmoid function and compute the mean squared error loss. Based on the loss value, we will update the weights such that the overall loss of the model at the new parameters will be of the model. w b less than the current loss To understand the math behind the gradient descent optimization technique, kindly go through my previous article on the sigmoid neuron learning algorithm — linked at the end of this article. Sigmoid Neuron Class Before we start analyzing different variants of gradient descent algorithm, We will build our model inside a class called SN. = w_init self.b = b_init self.w_h = [] self.b_h = [] self.e_h = [] self.algo = algo #logistic : # ( , , , ): . class SN constructor def __init__ self w_init b_init algo self w ( ): : = . : = . 1. / ( ))) # ( ): : = . : = . = 0 , ( ): += 0.5 * ( ) - ) ** 2 ( ): : = . : = . = . ( ) ( ) * * ( ) * ( ): : = . : = . = . ( ) ( ) * * ( ) ( ): . = [] . = [] . = [] . = . = . == ' ': ( ): , = 0, 0 , ( ): += . ( ) += . ( ) . -= * / . [0] . -= * / . [0] . ( ) . == ' ': ( ): , = 0, 0 = 0 , ( ): += . ( ) += . ( ) += 1 % == 0: . -= * / . -= * / . ( ) , = 0, 0 . == ' ': , = 0, 0 ( ): , = 0, 0 , ( ): += . ( ) += . ( ) = * + * = * + * . = . - . = . - . ( ) . == ' ': , = 0, 0 ( ): , = 0, 0 = * = * , ( ): += . ( ) += . ( ) = + * = + * . = . - . = . - . ( ) # ( ): . . ( ) . . ( ) . . ( )) function def sigmoid self, x, w=None, b=None if w is None w self w if b is None b self b return + np.exp(-(w*x + b 1. loss function def error self, X, Y, w=None, b=None if w is None w self w if b is None b self b err for x y in zip X, Y err self.sigmoid(x, w, b y return err def grad_w self, x, y, w=None, b=None if w is None w self w if b is None b self b y_pred self sigmoid x, w, b return y_pred - y y_pred - y_pred 1 x def grad_b self, x, y, w=None, b=None if w is None w self w if b is None b self b y_pred self sigmoid x, w, b return y_pred - y y_pred - y_pred 1 def fit self, X, Y, epochs= , eta= , gamma= , mini_batch_size= , eps= , beta= , beta1= , beta2= 100 0.01 0.9 100 1e-8 0.9 0.9 0.9 self w_h self b_h self e_h self X X self Y Y if self algo GD for i in range epochs dw db for x y in zip X, Y dw self grad_w x, y db self grad_b x, y self w eta dw X shape self b eta db X shape self append_log elif self algo MiniBatch for i in range epochs dw db points_seen for x y in zip X, Y dw self grad_w x, y db self grad_b x, y points_seen if points_seen mini_batch_size self w eta dw mini_batch_size self b eta db mini_batch_size self append_log dw db elif self algo Momentum v_w v_b for i in range epochs dw db for x y in zip X, Y dw self grad_w x, y db self grad_b x, y v_w gamma v_w eta dw v_b gamma v_b eta db self w self w v_w self b self b v_b self append_log elif self algo NAG v_w v_b for i in range epochs dw db v_w gamma v_w v_b gamma v_b for x y in zip X, Y dw self grad_w x, y, self.w - v_w, self.b - v_b db self grad_b x, y, self.w - v_w, self.b - v_b v_w v_w eta dw v_b v_b eta db self w self w v_w self b self b v_b self append_log logging def append_log self self w_h append self.w self b_h append self.b self e_h append self.error(self.X, self.Y View on Github # def __init__(self, w_init, b_init, algo): self.w = w_init self.b = b_init self.w_h = [] self.b_h = [] self.e_h = [] self.algo = algo constructor The function (constructor function) helps to initialize the parameters of sigmoid neuron weights and biases. The function takes three arguments, __init__ w b — These parameters take the initial values for the parameters ‘ ’ and ‘ ’ instead of setting parameters randomly, we are setting it to specific values. This allows us to understand how an algorithm performs by visualizing for different initial points. Some algorithms get stuck in local minima at some parameters. w_init,b_init w b — It tells which variant of gradient descent algorithm to use for finding the optimal parameters. algo In this function, we are initializing the parameters and we have defined three new array variables suffixed with ‘_h’ denoting that they are history variables, to keep track of how weights (w_h), biases (b_h) and error (e_h) values are changing as the sigmoid neuron learns the parameters. def sigmoid(self, x, w=None, b=None): w is None: w = self.w b is None: b = self.b / ( + np.exp(-(w*x + b))) if if return 1. 1. We have thesigmoid function that takes input mandatory argument and computes the logistic function for the input along with the parameters. The function also takes two other optional arguments. x — —By taking ‘ ’ and ‘ ’ as the parameters it helps us to calculate the value of the sigmoid function at specifically specified values of parameters. If these arguments are not passed, it will take the values of learned parameters to compute the logistic function. w & b w b def error(self, X, Y, w=None, b=None): w is None: w = self.w b is None: b = self.b err = x, y zip(X, Y): err += * (self.sigmoid(x, w, b) - y) ** err if if 0 for in 0.5 2 return Next, we have function that takes input and as mandatory arguments and optional parameter arguments like in the function. In this function, we are iterating through each data point and computing the cumulative mean squared error between the actual feature value and predicted feature value using the function. Like we have seen in the sigmoid function, it has support for calculating the error at specified parameter values. error X Y sigmoid sigmoid def grad_w(self, x, y, w=None, b=None): ..... def grad_b(self, x, y, w=None, b=None): ..... Next, we will define two functions and takes input ‘ ’ and ‘ ’ as mandatory arguments, which helps to compute the gradient of the sigmoid with respect to inputs for the parameters ‘ ’ and ‘ ’ respectively. Again we have two optional arguments that allow us to compute the gradient at specified parameter values. grad_w grad_b x y w b def fit(self, X, Y, epochs= , eta= , gamma= , mini_batch_size= , eps= ,beta= , beta1= , beta2= ): self.w_h = [] ....... 100 0.01 0.9 100 1e-8 0.9 0.9 0.9 Next, we define the ‘fit’ method that takes inputs ‘ ’, ‘ ’ and a bunch of other parameters. I will explain these parameters whenever it is being used for a particular variant of gradient descent algorithm. The function starts by initializing the history variables and setting the local input variables to store the input parameter data. X Y Then we have a bunch of different ‘if-else’ statements for each algorithm it supports. Depending on the algorithm we use to choose, we will implement the gradient descent in the fit function. I will explain each of these implementations in detail in the later part of the article. def append_log(self): self.w_h.append(self.w) self.b_h.append(self.b) self.e_h.append(self.error(self.X, self.Y)) Finally, we have function to store the value of parameters and loss function value for each epoch in each variant of gradient descent. theappend_log Setting Up for Plotting In this section, we will define some configuration parameters for simulating the gradient descent update rule using a simple 2D toy data set. We also define some functions to create and animate the 3D & 2D plots to visualize the working of update rule. This kind of setup helps us to run different experiments with different starting points, different hyperparameter settings and plot/animate update rule for different variants of the gradient descent. #Data X = np.asarray([ , , , , , ]) Y = np.asarray([ , , , , , ]) #Algo and parameter values algo = w_init = b_init = #parameter min and max values- to plot update rule w_min = w_max = b_min = b_max = #learning algorithum options epochs = mini_batch_size = gamma = eta = #animation number frames animation_frames = #plotting options plot_2d = True plot_3d = False 3.5 0.35 3.2 -2.0 1.5 -0.5 0.5 0.50 0.5 0.5 0.1 0.3 'GD' 2.1 4.0 -7 5 -7 5 200 6 0.9 5 of 20 First, we take a simple 2D toy data set that has two inputs and two outputs. In line 5, we define a string variable algo that takes which type of algorithm we want to execute. We initialize the parameters ‘ ’ and ‘ ’ in Line 6 — 7 to indicate where the algorithm starts. w b From line 9 — 12 we are setting the limits for the parameters, the range where sigmoid neuron searches for the best parameters within the specified range. These numbers are specifically chosen well to illustrate the working gradient descent update rule. Next, we are setting values of hyperparameters some variables will be specific to some algorithms, I will discuss them when we are discussing the implementation of algorithms. Finally, from line 19–22, we declared the variables required to animate or plot the update rule. sn = SN(w_init, b_init, algo) sn.fit(X, Y, epochs=epochs, eta=eta, gamma=gamma, mini_batch_size=mini_batch_size) plt.plot(sn.e_h, ) plt.plot(sn.w_h, ) plt.plot(sn.b_h, ) plt.legend(( , , )) plt.title( ) plt.xlabel( ) plt.show() 'r' 'b' 'g' 'error' 'weight' 'bias' "Variation of Parameters and loss function" "Epoch" Once we set up our configuration parameters, we will initialize our class and then call the fit method using the configuration parameters. We also plot our three history variables to visualize how the parameters and loss function value varies across epochs. SN 3D & 2D Plotting Setup plot_3d: W = np.linspace(w_min, w_max, ) b = np.linspace(b_min, b_max, ) WW, BB = np.meshgrid(W, b) Z = sn.error(X, Y, WW, BB) fig = plt.figure(dpi= ) ax = fig.gca(projection= ) surf = ax.plot_surface(WW, BB, Z, rstride= , cstride= , alpha= , cmap=cm.coolwarm, linewidth= , antialiased=False) cset = ax.contourf(WW, BB, Z, , zdir= , offset= , alpha= , cmap=cm.coolwarm) ax.set_xlabel( ) ax.set_xlim(w_min - , w_max + ) ax.set_ylabel( ) ax.set_ylim(b_min - , b_max + ) ax.set_zlabel( ) ax.set_zlim( , np.max(Z)) ax.view_init (elev= , azim= ) # azim = ax.dist= title = ax.set_title( ) if 256 256 100 '3d' 3 3 0.5 0 25 'z' -1 0.6 'w' 1 1 'b' 1 1 'error' -1 25 -75 -20 12 'Epoch 0' To create a 3D plot first, we create a mesh grid by creating 256 equally spaced values between the minimum and maximum values of ‘ ’ and ‘ ’ as shown in line 2–5. Using the mesh grid will calculate the error (line 5) for these values by calling the function in our sigmoid class . In line 8, we are creating an axis handle to create a 3D plot. w b error SN To create a 3D plot, we are creating a surface plot of the error with respect to weight and bias using function by specifying how often we want to sample the points along with the data by setting and . Next, we are plotting the contour of the error with respect to weight and bias on top of the surface using ax.contourf function by specifying error values as ‘Z’ direction (Line 9 — 10). In line 11–16, we are setting the labels for each axis and axis limits for all three dimensions. Because we are plotting the 3D plot, we need to define the viewpoint. In line 17–18 we are setting a viewpoint for our plot at an elevation of 25 degrees in the ‘z’ axis and at a distance of 12 units. ax.plot_surface rstride cstride def plot_animate_3d(i): i = int(i*(epochs/animation_frames)) line1.set_data(sn.w_h[:i+ ], sn.b_h[:i+ ]) line1.set_3d_properties(sn.e_h[:i+ ]) line2.set_data(sn.w_h[:i+ ], sn.b_h[:i+ ]) line2.set_3d_properties(np.zeros(i+ ) - ) title.set_text( .format(i, sn.e_h[i])) line1, line2, title plot_3d: #animation plots gradient descent i = line1, = ax.plot(sn.w_h[:i+ ], sn.b_h[:i+ ], sn.e_h[:i+ ], color= ,marker= ) line2, = ax.plot(sn.w_h[:i+ ], sn.b_h[:i+ ], np.zeros(i+ ) - , color= , marker= ) anim = animation.FuncAnimation(fig, func=plot_animate_3d, frames=animation_frames) rc( , html= ) anim 1 1 1 1 1 1 1 'Epoch: {: d}, Error: {:.4f}' return if of 0 1 1 1 'black' '.' 1 1 1 1 'red' '.' 'animation' 'jshtml' On top of our static 3D plot, we want to visualize what the algorithm is doing dynamically which is captured by our history variables for parameters and error function at every epoch of our algorithm. To create an animation of our gradient descent algorithm, we will use function by passing our custom function as one of the parameters and also specify the number of frames needed to create an animation. The function updates the values of parameters and error value for the respective values of ‘w’ and ‘b’. In the same function at the line — 7, we are setting the text to show the error value at that particular epoch. Finally, to display the animation in-line we call the function to render the HTML content inside the jupyter notebook. animation.FuncAnimation plot_animate_3d plot_animate_3d rc Similar to the 3D plot, we can create a function to plot 2D contour plots. plot_2d: W = np.linspace(w_min, w_max, ) b = np.linspace(b_min, b_max, ) WW, BB = np.meshgrid(W, b) Z = sn.error(X, Y, WW, BB) fig = plt.figure(dpi= ) ax = plt.subplot( ) ax.set_xlabel( ) ax.set_xlim(w_min - , w_max + ) ax.set_ylabel( ) ax.set_ylim(b_min - , b_max + ) title = ax.set_title( ) cset = plt.contourf(WW, BB, Z, , alpha= , cmap=cm.bwr) plt.savefig( ,dpi = ) plt.show() def plot_animate_2d(i): i = int(i*(epochs/animation_frames)) line.set_data(sn.w_h[:i+ ], sn.b_h[:i+ ]) title.set_text( .format(i, sn.e_h[i])) line, title plot_2d: i = line, = ax.plot(sn.w_h[:i+ ], sn.b_h[:i+ ], color= ,marker= ) anim = animation.FuncAnimation(fig, func=plot_animate_2d, frames=animation_frames) rc( , html= ) anim if 256 256 100 111 'w' 1 1 'b' 1 1 'Epoch 0' 25 0.8 "temp.jpg" 2000 1 1 'Epoch: {: d}, Error: {:.4f}' return if 0 1 1 'black' '.' 'animation' 'jshtml' Algorithm Implementation In this section, we will implement different variants of gradient descent algorithm and generate 3D & 2D animation plots. Vanilla Gradient Descent Gradient descent algorithm updates the parameters by moving in the direction opposite to the gradient of the objective function with respect to the network parameters. Parameter update rule will be given by, Gradient Descent Update Rule i range(epochs): dw, db = , x, y zip(X, Y): dw += self.grad_w(x, y) db += self.grad_b(x, y) self.w -= eta * dw / X.shape[ ] self.b -= eta * db / X.shape[ ] self.append_log() for in 0 0 for in 0 0 In the batch gradient descent, we iterate over all the training data points and compute the cumulative sum of gradients for parameters ‘w’ and ‘b’. Then update the values of parameters based on the cumulative gradient value and the learning rate. To execute the gradient descent algorithm change the configuration settings as shown below. X = np.asarray([ , ]) Y = np.asarray([ , ]) algo = w_init = b_init = w_min = w_max = b_min = b_max = epochs = eta = animation_frames = plot_2d = True plot_3d = True 0.5 2.5 0.2 0.9 'GD' -2 -2 -7 5 -7 5 1000 1 20 In the configuration settings, we are setting the variable to ‘GD’ to indicate we want to execute the vanilla gradient descent algorithm in our sigmoid neuron to find the best parameter values. After we set up our configuration parameters, we will go ahead and execute the class ‘fit’ method to train sigmoid neuron on toy data. algo SN Gradient Descent History The above plot shows how the history values of error, weight, and bias vary across different epochs while the algorithm is learning the best parameters. The important point to note from the graph is that during the initial epochs error value is hovering close to 0.5 but after 200 epochs the error values reaches almost zero. If you want to plot 3D or 2D animation, you can set the boolean variables and . I will show how the 3D error surface would look like for corresponding values of ‘ ’ and ‘ ’. The objective of the learning algorithm is to move towards the deep blue color region where the error/loss is minimum. plot_2d plot_3d w b To visualize what algorithm is doing dynamically, we can generate an animation by using the function . As you play the animation, you can see the epoch number and the corresponding error value at that epoch. plot_animate_3d If you want to slow down the animation, you can do that by clicking on the minus symbol in the video controls as shown in the above animation. Similarly, you can generate the animation for 2D contour plot to see how the algorithm is moving towards the global minima. Momentum-based Gradient Descent In Momentum GD, we are moving with an exponential decaying cumulative average of previous gradients and current gradient. The code for the Momentum GD is given below, v_w, v_b = , i range(epochs): dw, db = , x, y zip(X, Y): dw += self.grad_w(x, y) db += self.grad_b(x, y) v_w = gamma * v_w + eta * dw v_b = gamma * v_b + eta * db self.w = self.w - v_w self.b = self.b - v_b self.append_log() 0 0 for in 0 0 for in In Momentum based GD, we have included the history variable to keep track of the values of previous gradients. The variable gamma denotes the how much of momentum we need to impart to the algorithm. The variables and will be used to compute the movement of the gradient based on the history as well as the current gradient. At the end of each epoch, we are calling the function to store the history of parameters and loss function values. v_w v_b append_log To execute the Momentum GD for our sigmoid neuron, you need to make few modifications to the configuration settings as shown below, X = np.asarray([ , ]) Y = np.asarray([ , ]) algo = w_init = b_init = w_min = w_max = b_min = b_max = epochs = mini_batch_size = gamma = eta = animation_frames = plot_2d = True plot_3d = True 0.5 2.5 0.2 0.9 'Momentum' -2 -2 -7 5 -7 5 1000 6 0.9 1 20 The variable is set to ‘Momentum’ to indicate that we want to use the Momentum GD for finding the best parameters for our sigmoid neuron and another important change is the variable, which is used to control how much momentum we need to impart into the learning algorithm. Gamma value varies between 0–1. After we set up our configuration parameters, we will go ahead and execute the class ‘fit’ method to train sigmoid neuron on toy data. algo gamma SN Variation for Momentum GD From the plot, we can see that there were a lot of oscillations in the values of weight and bias terms because of the accumulated history Momentum GD oscillates in and out of minima. Nesterov Accelerated Gradient Descent In Nesterov Accelerated Gradient Descent we are looking forward to seeing whether we are close to the minima or not before we take another step based on the current gradient value so that we can avoid the problem of overshooting. The code for the Momentum GD is given below, v_w, v_b = , i range(epochs): dw, db = , v_w = gamma * v_w v_b = gamma * v_b x, y zip(X, Y): dw += self.grad_w(x, y, self.w - v_w, self.b - v_b) db += self.grad_b(x, y, self.w - v_w, self.b - v_b) v_w = v_w + eta * dw v_b = v_b + eta * db self.w = self.w - v_w self.b = self.b - v_b self.append_log() 0 0 for in 0 0 for in The main change in the code for NAG GD is that the computation of and . In Momentum GD, we are computing these variables in one step but in NAG we are doing it in two steps. v_w v_b v_w = gamma * v_w v_b = gamma * v_b x, y zip(X, Y): dw += self.grad_w(x, y, self.w - v_w, self.b - v_b) db += self.grad_b(x, y, self.w - v_w, self.b - v_b) v_w = v_w + eta * dw v_b = v_b + eta * db for in In the first part, before we iterate through the data, we will multiply the gamma with our history variables and then the gradient is computed by using the subtracted history value from and . To execute the NAG GD, we need just need to set the variable to ‘NAG’. You can generate the 3D or 2D animations to see how the NAG GD is different from Momentum GD in reaching the global minima. self.w self.b algo Mini-Batch and Stochastic Gradient Descent Instead of looking at all data points at one go, we will divide the entire data into a number of subsets. For each subset of data, compute the derivates for each of the point present in the subset and make an update to the parameters. Instead of calculating the derivative for entire data with respect to the loss function, we have approximated it to fewer points or smaller batch size. This method of computing gradients in batches is called the Mini-Batch Gradient Descent The code for the Mini-Batch GD is given below, i range(epochs): dw, db = , points_seen = x, y zip(X, Y): dw += self.grad_w(x, y) db += self.grad_b(x, y) points_seen += points_seen % mini_batch_size == : self.w -= eta * dw / mini_batch_size self.b -= eta * db / mini_batch_size self.append_log() dw, db = , for in 0 0 0 for in 1 if 0 0 0 In Mini Batch, we are looping through the entire data and keeping track of the number of points we have seen by using a variable points_seen. If the number of points seen is a multiple of mini-batch size then we are updating the parameters of the sigmoid neuron. In the special case when mini-batch size is equal to one, then it would become Stochastic Gradient Descent. To execute the Mini-Batch GD, we need just need to set the algo variable to ‘MiniBatch’. You can generate the 3D or 2D animations to see how the Mini-Batch GD is different from Momentum GD in reaching the global minima. AdaGrad Gradient Descent The main motivation behind the AdaGrad was the idea of Adaptive Learning rate for different features in the dataset, i.e. instead of using the same learning rate across all the features in the dataset, we might need different learning rate for different features. The code for Adagrad is given below, v_w, v_b = , i range(epochs): dw, db = , x, y zip(X, Y): dw += self.grad_w(x, y) db += self.grad_b(x, y) v_w += dw** v_b += db** self.w -= (eta / np.sqrt(v_w) + eps) * dw self.b -= (eta / np.sqrt(v_b) + eps) * db self.append_log() 0 0 for in 0 0 for in 2 2 In Adagrad, we are maintaining the running squared sum of gradients and then we update the parameters by dividing the learning rate with the square root of the historical values. Instead of having a static learning rate here we have dynamic learning for dense and sparse features. The mechanism to generate plots/animation remains the same as above. The idea here is to play with different toy datasets and different hyperparameter configurations. RMSProp Gradient Descent In RMSProp history of gradients is calculated using an exponentially decaying average unlike the sum of gradients in AdaGrad, which helps to prevent the rapid growth of the denominator for dense features. The code for RMSProp is given below, v_w, v_b = , i range(epochs): dw, db = , x, y zip(X, Y): dw += self.grad_w(x, y) db += self.grad_b(x, y) v_w = beta * v_w + ( - beta) * dw** v_b = beta * v_b + ( - beta) * db** self.w -= (eta / np.sqrt(v_w) + eps) * dw self.b -= (eta / np.sqrt(v_b) + eps) * db self.append_log() 0 0 for in 0 0 for in 1 2 1 2 The only change we need to do in AdaGrad code is how we update the variables and . In AdaGrad and v_b is always increasing by squares of the gradient per parameter wise since the first epoch but in RMSProp and is exponentially decaying weighted sum of gradients by using a hyperparameter called ‘gamma’. To execute the RMSProp GD, we need just need to set the algo variable to ‘RMSProp’. You can generate the 3D or 2D animations to see how the RMSProp GD is different from AdaGrad GD in reaching the global minima. v_w v_b v_w v_w v_b Adam Gradient Descent Adam maintains two histories, ‘mₜ’ similar to the history used in Momentum GD and ‘vₜ’ similar to the history used in RMSProp. In practice, Adam does something known as bias correction. It uses the following equations for ‘mₜ’ and ‘vₜ’, Bias Correction Bias correction ensures that at the beginning of the training updates don’t behave in a weird manner. The key point in Adam is that it combines the advantages of Momentum GD (moving faster in gentle regions) and RMSProp GD (adjusting learning rate). The code for Adam GD is given below, v_w, v_b = , m_w, m_b = , num_updates = i range(epochs): dw, db = , x, y zip(X, Y): dw = self.grad_w(x, y) db = self.grad_b(x, y) num_updates += m_w = beta1 * m_w + ( -beta1) * dw m_b = beta1 * m_b + ( -beta1) * db v_w = beta2 * v_w + ( -beta2) * dw** v_b = beta2 * v_b + ( -beta2) * db** m_w_c = m_w / ( - np.power(beta1, num_updates)) m_b_c = m_b / ( - np.power(beta1, num_updates)) v_w_c = v_w / ( - np.power(beta2, num_updates)) v_b_c = v_b / ( - np.power(beta2, num_updates)) self.w -= (eta / np.sqrt(v_w_c) + eps) * m_w_c self.b -= (eta / np.sqrt(v_b_c) + eps) * m_b_c self.append_log() 0 0 0 0 0 for in 0 0 for in 1 1 1 1 2 1 2 1 1 1 1 In Adam optimizer, we compute the to keep track of momentum history and which is used to decay the denominator and prevent its rapid growth just like in RMSProp. After that, we implement the bias correction for the Momentum based history variables and RMSProp based history variables. Once we compute the corrected values of the parameters ‘ ’ and ‘ ’, we will use those values to update the values of parameters. m_w & m_b v_w & v_b w b To execute the Adam gradient descent algorithm change the configuration settings as shown below. X = np.asarray([ , , , , , ]) Y = np.asarray([ , , , , , ]) algo = w_init = b_init = w_min = w_max = b_min = b_max = epochs = gamma = eta = eps = animation_frames = plot_2d = True plot_3d = False 3.5 0.35 3.2 -2.0 1.5 -0.5 0.5 0.50 0.5 0.5 0.1 0.3 'Adam' -6 4.0 -7 5 -7 5 200 0.9 0.5 1e-8 20 The variable algo is set to ‘Adam’ to indicate that we want to use the Adam GD for finding the best parameters for our sigmoid neuron and another important change is the gamma variable, which is used to control how much momentum we need to impart into the learning algorithm. Gamma value varies between 0–1. After we set up our configuration parameters, we will go ahead and execute the class ‘fit’ method to train sigmoid neuron on toy data. SN Variation of Parameters in Adam GD We can also create the 2D contour plot animation which shows how Adam GD is learning the path to global minima. Adam GD Animation Unlike in the Case of RMSProp, we don’t have many oscillations instead we are moving more deterministically towards the minima especially after a first few epochs. This brings to the end of our discussion on how to implement the optimization techniques using Numpy. What's Next? LEARN BY PRACTICING In this article, we have seen different we have taken toy data set with static initialization points but what you can do is that take different initialization points and for each of these initialization points, play with different algorithms and see what kind of tuning needed to be done in hyperparameters. The entire code discussed in the article is present in this GitHub repository. Feel free to fork it or download it. . The best part is that you can directly run the code in google colab, don’t need to worry about installing the packages https://github.com/Niranjankumar-c/GradientDescent_Implementation?source=post_page-----809e7ab3bab4---------------------- PS: If you are interested in converting the code into R, send me a message once it is done. I will feature your work here and also on the GitHub page. Conclusion In this post, we have seen how to implement different variants of gradient algorithm by taking a simple sigmoid neuron. Also, we have seen how to create beautiful 3D or 2D animation for each of these variants that show how the learning algorithm finds the best parameters. --------------- What’s Next? Backpropagation is the backbone of how neural networks learn what they learn. If you are interested in learning more about Neural Networks, check out the by Abhishek and Pukhraj from . This course will be taught using the latest version of Tensorflow 2.0 (Keras backend). Artificial Neural Networks Starttechacademy In my next post, we will discuss different activation functions such as logistic, ReLU, LeakyReLU, etc… and some best initialization techniques like Xavier and He initialization. So make sure you follow on medium to get notified as soon as it drops. me Until then Peace :) NK. is an intern at HSBC Analytics division. He is passionate about deep learning and AI. He is one of the top writers at in . Connect with me on or follow me on for updates about upcoming articles on deep learning and Artificial Intelligence. Niranjan Kumar Medium Artificial Intelligence LinkedIn twitter Currently, I am looking for opportunities either full-time or freelance projects, in the field of Machine Learning and Deep Learning. Feel free to drop me a message either on LinkedIn or you can reach me through email as well. I would love to discuss.