We know that in a neural network, weights are initialized usually randomly and that kind of initialization takes fair / significant amount of repetitions to converge to the least loss and reach to the ideal weight matrix. The problem is, this kind of initialization is prone to vanishing or exploding gradient problems.
One way to reduce this problem is carefully choosing the random weight initialization. Xavier’s random weight initialization aka Xavier’s algorithm factors into the equation the size of the network (number of input and output neurons) and addresses these problems.
Xavier Glorot and Yoshua Bengio are the contributors for this concept of initializing better random weights. This not only reduces the chances for running into the gradient problems but also helps to converge to least error faster.
a) If you’re using ReLu activation function in the deep nets (I’m talking about the hidden layer’s output activation function) then:
b) Likewise if you’re using Tanh activation function :
Only major difference in Xavier’s initialization is the output no term. We add the number of output units for that layer.
# python code is here
import numpy as np
W = np.random.rand((x_dim,y_dim))*np.sqrt(1/(ni+no))
This sort of initialization helps to set the weight matrix neither too bigger than 1, nor too smaller than 1. Thus it doesn’t explode or vanish gradients respectively.
I learnt this from Coursera’s Awesome Deep Learning Specialization: deeplearning.ai
Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization :
Here is the original Paper:
Understanding the difficulty of training deep feedforward neural networks
Xavier Glorot, Yoshua Bengio ; PMLR 9:249–256
If you liked this article, then clap it up! :) Maybe a follow?
Connect with me on Social: