The previous two posts can be found here: DL01: Neural Networks Theory DL02: Writing a Neural Network from Scratch (Code) To understand gradient descent, let’s conisder linear regression. Linear regression is a technique, where given some data points, we try to fit a line through those points and then make predictions by extrapolating that line. The challenge is to find the best fit for the line. For the sake of simplicity, we’ll assume that the output( ) depends on just one input variable( ) i.e. the data points given to us are of the form . y x (x, y) Now we want to fit a line through these data points. The line will obviously be of the form , and we want to find the best values for and This line is called our “hypothesis”. y = mx + b m b. To get started with, and are randomly initialized. Hence, we get a random line in the beginning. Our goal is to update these values so that the resulting line gives least error. We’ll use mean squared error as the , that calculates the error between the actual value output, and prediction from the hypothesis. m b cost function(J) J(m, b) = (1/n) * (yᵢ - (mxᵢ + b))² ∑ where represents the total number of data points and represents the data point ( takes values from 1 to n). n i iᵗʰ i This error function is typically . In simple terms, we can say that this error function typically has just one minimum (the global minimum). convex A convex function The horizontal axes represent the variables on which the function depends, and the vertical dimension represents the output. In our case, the horizontal axes are represented by and , while the vertical axis is represented by m b y. When we start with random values of and , we get some value of correspondingly. Error is minimum at the lowest point on this graph, so our goal is to move down the slope to finally reach the bottom-most point. The question is, how? m b y Before moving on, I’d like you to recall some of your high school maths. The slope of the tangent at any point on a graph is equal to the derivative of the graph w.r.t. input variables. Source: https://spin.atomicobject.com/2014/06/24/gradient-descent-linear-regression/ Now, as you can see, the slope of tangent at the bottom-most point on the graph is 0 i.e. partial derivatives of J at the bottom-most point are 0. To get to the bottom-most point, we have to move in the direction of the slope i.e. we’ll update values of and , such that we eventually get to the optimum values, where error function is minimum. m b The update equations would be as follows: Here, α is called the “learning rate”, and it decides how large steps to take in the direction of the gradient. If the value of α is very small, reaching the optimum value is guaranteed, but it’ll take a lot of time to converge. If α is very large, then the values of and might overshoot the optimal values, and then the error will start to increase instead of decreasing. Hence, learning rate plays an important part in convex optimization. m b A visual representation of gradient descent in action has been shown below: Source: https://alykhantejani.github.io/images/gradient_descent_line_graph.gif How does it work intuitively? For the sake of simplicity, consider an error function that depends on only one variable (or weight) . J w Consider the situation as shown in figure i.e. initial weight is more than the optimal value. Here, initially the slope is large and positive. So, in the update equation, w is reduced. As w keeps getting reduced, notice that the gradient also reduces, and hence the updates become smaller and smaller and eventually, it converges to the minimum. Now consider the situation when the initial weight is less than the optimal value. Here, the slope is negative. So, in the update equation, the value of w increases after every update. Initially, the slope is large, so the updates are large. As w gets increased, the slope keeps getting smaller, and hence the updates keep getting smaller and eventually converge to the minimum. So, in any case, the weights will finally converge to a value such that the derivative of the cost function is 0 (given that the value of alpha is sufficiently less). In this post, I covered gradient descent for linear regression. However, this is the basic idea behind neural networks too. All the weights in the neural network are updated according to this algorithm. Gradient descent is used for optimization in the , which has been covered in the . backpropagation algorithm next post Sometimes the smallest step in the right direction ends up being the biggest step of your life. Gradient Descent 101:

The Graph

DL03: Gradient Descent

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

DL01: Neural Networks Theory

10 Things Everyone Should Know About Machine Learning

10 Repositories that Will Transform the Way You Approach Technical Interviews

10 (Free) Data Structure and Algorithm Courses Junior Developers Should Explore

10 Data Structure & Algorithms Books Every Programmer Should Read

The Noonification: How to Develop a DSL in Kotlin (12/12/2023)

DL01: Neural Networks Theory

10 Things Everyone Should Know About Machine Learning

10 Repositories that Will Transform the Way You Approach Technical Interviews

10 (Free) Data Structure and Algorithm Courses Junior Developers Should Explore

10 Data Structure & Algorithms Books Every Programmer Should Read

The Noonification: How to Develop a DSL in Kotlin (12/12/2023)

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps