paint-brush
How to Initialize weights in a neural net so it performs well?by@rakshithvasudev
24,414 reads
24,414 reads

How to Initialize weights in a neural net so it performs well?

by VasudevMay 12th, 2018
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

We know that in a neural network, weights are initialized usually randomly and that kind of initialization takes fair / significant amount of repetitions to converge to the least loss and reach to the ideal weight matrix. The problem is, this kind of initialization is prone to vanishing or exploding gradient problems.

Companies Mentioned

Mention Thumbnail
Mention Thumbnail
featured image - How to Initialize weights in a neural net so it performs well?
Vasudev HackerNoon profile picture

http://www.mdpi.com/1099-4300/19/3/101

We know that in a neural network, weights are initialized usually randomly and that kind of initialization takes fair / significant amount of repetitions to converge to the least loss and reach to the ideal weight matrix. The problem is, this kind of initialization is prone to vanishing or exploding gradient problems.

One way to reduce this problem is carefully choosing the random weight initialization. Xavier’s random weight initialization aka Xavier’s algorithm factors into the equation the size of the network (number of input and output neurons) and addresses these problems.

Xavier Glorot and Yoshua Bengio are the contributors for this concept of initializing better random weights. This not only reduces the chances for running into the gradient problems but also helps to converge to least error faster.

General ways to make it initialize better weights:

a) If you’re using ReLu activation function in the deep nets (I’m talking about the hidden layer’s output activation function) then:

  1. Generate random sample of weights from a Gaussian distribution having mean 0 and a standard deviation of 1.
  2. Multiply that sample with the square root of (2/ni). Where ni is number of input units for that layer.

b) Likewise if you’re using Tanh activation function :

  1. Generate random sample of weights from a Gaussian distribution having mean 0 and a standard deviation of 1.
  2. Multiply that sample with the square root of (1/ni). Where ni is number of input units for that layer.

So what is this Xavier’s initialization?

Only major difference in Xavier’s initialization is the output no term. We add the number of output units for that layer.

For Tanh:

  1. Generate random sample of weights from a Gaussian distribution having mean 0 and a standard deviation of 1.
  2. Multiply that sample with the square root of (1/(ni+no)). Where ni is number of input units, no is the number of output units for that layer respectively.

# python code is here

import numpy as np

W = np.random.rand((x_dim,y_dim))*np.sqrt(1/(ni+no))

Why does this initialization help prevent gradient problems?

This sort of initialization helps to set the weight matrix neither too bigger than 1, nor too smaller than 1. Thus it doesn’t explode or vanish gradients respectively.

I learnt this from Coursera’s Awesome Deep Learning Specialization: deeplearning.ai

Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization :

https://www.coursera.org/learn/deep-neural-network/

Here is the original Paper:

Understanding the difficulty of training deep feedforward neural networks

Xavier Glorot, Yoshua Bengio ; PMLR 9:249–256

If you liked this article, then clap it up! :) Maybe a follow?

Connect with me on Social:


Rakshith Vasudev | LinkedIn_View Rakshith Vasudev's profile on LinkedIn, the world's largest professional community. Rakshith's education is listed…_www.linkedin.com


Rakshith Vasudev_Rakshith Vasudev. Learn Artificial Intelligence with me as we progress to make this world a better place. Tensorflow…_www.facebook.com


Rakshith Vasudev_Getting started with Datascience, best programming practices. Topics include Machine Learning and others._www.youtube.com