The previous three posts can be found here:
DL01: Neural Networks Theory
DL02: Writing a Neural Network from Scratch (Code)
DL03: Gradient Descent
So, welcome to part 4 of this series! This would require a little bit of maths, so basic calculus is a pre-requisite.
In this post, I’ll try to explain backprop with a very simple neural network, shown below:
L1, L2 and L3 represent the layers in the neural net. The numbers in square bracket represent the layer number. I’ve numbered every node on each layer. E.g. second node of first layer is numbered as 2, and so on.
I’ve labelled every weight too. E.g. the weight connecting second node of second layer (2) to first node of third layer (1) is w21.
I’ll assume that we’re using activation function g(z).
The basic concept behind backpropagation is to calculate error derivatives. After the forward pass through the net, we calculate the error, and then update the weights through gradient descent (using the error derivatives calculated by backprop).
If you understand chain rule in differentiation, you’ll easily understand backpropagation.
First, let us write the equations for the forward pass.
Let x1and x2 be the inputs at L1.
z as the weighted sum of previous layer, and
a as the output of a node after applying non-linearity/activation function
z1 = w11*x1 + w21*x2
a1 = g(z1)
z2 = w12*x1 + w22*x2
a2 = g(z2)
z3 = w13*x1 + w23*x2
a3 = g(z3)
z1 = w11*a1 + w21*a2 + w31*a3
a1 = g(z1)
Now I’ll use MSE as a loss function.
E = (1/2)×(a1−t1)², where t1 is the target label.
To use gradient descent to backpropagate through the net, we’ll have to calculate the derivative of this error w.r.t. every weight, and then perform weight updates.
We need to find dE/dwij[k], where, wij is a weight at layer k.
By chain rule, we have
Now, we can calculate these three terms on the RHS as follows,
Therefore, we have
Let δ1 = (a1 — t) * (g’(z1)).
Hence, we have
Here, we’ll call δ1 as the error propagated by node 1 of layer 3.
Now, we want to go one layer back.
On simplifying, we get
Similarly backpropagating through other nodes in L2, we get
In terms of δ, these can be written as
When we have all the error derivatives, weights are updated as:
where η is called the ‘learning rate’,
So, this was backprop for you! It is completely understandable if your brain is like this right now:
To get a better grasp of this, you can try deriving it yourself.
You can even do it for some other error functions like cross-entropy loss (with softmax). Or for a particular activation function like sigmoid, tanh, relu or leaky rely.
I hope I was able to clear the basics of backpropagation through this post. A lot of time and effort was put into this, so feedback would be appreciated!