711 reads

Rules of thumb for Deep Learning

by A Naveen KumarAugust 1st, 2018

Too Long; Didn't Read

With my few years of experience in training and using various available open source model <a href="https://hackernoon.com/tagged/networks" target="_blank">networks</a>, I have learned the hard way of setting various hyper parameters and using them efficiently. I have lost track of sources from where I have collected the info, but, this mostly seems to work for me. So, today, I would like to share the ones I remember.

featured image - Rules of thumb for Deep Learning

With my few years of experience in training and using various available open source model networks, I have learned the hard way of setting various hyper parameters and using them efficiently. I have lost track of sources from where I have collected the info, but, this mostly seems to work for me. So, today, I would like to share the ones I remember.

None of it might work for you out of the box, but, worked for me most of the times. most of the observations aren’t my intuitions, but, learned from various resources I have gone through.

Embedding Dimension

If you are trying to create your own vectors for words(anything), the dimension of embedding is slightly tricker to finalize on and it is confusing to know what might work and what might not. This particular drafted formula worked for me most of the scenarios.

the dimension of embedding to use: 4th root of total words(Dictionary of your content)

Output Feature Map Resolution

In image classification scenario, if you are planning to build your own network, my observation on various networks says, the resolution of your final feature map should be

the spatial resolution of final feature map= 1/34th of original image resolution (image classification)

The same observation on semantic segmentation has a slightly different variation of the formula

the spatial resolution of the encoder output feature map = 1/16th of original image resolution (semantic segmentation)

The Learning Rate

The biggest talking point, the learning rate, I don’t have a solid formula around it, but, the rough idea is to start high and reduce as you go ahead further, if you are starting from your own randomly initialised weights, then, make it high and move it around, if the weights are initialised from a pre-training network, then keep it slightly low and keep reducing as per your validation score or various formulae around it.

The formula is, always reduce the learning rate as your training progresses

Weight initialisation

Always Initialise with pre trained network weights (Imagenet or anything else)The problems with randomly initialised weights are worse than anyone can handle from the resources point of view.

LSTM ForgetGate

LSTMS: initialise the forget gate biases to higher values, if not it just acts as a sigmoid of your input(when the initialised weights end up being small) (Most of the major libraries do it by default now)

Dropout

A hard yes (always should be used)

BatchNormalisation:

Yes! (The paper says, co-variance shift, I say, it is better and faster during training phase)

Data preprocessing

Zero mean and unit variance ( most of the pre-trained networks you see use this and it works)

Mini batch size

32–128 works better (higher batch size than this might not yield better results most of the times) even the lower batch size is good enough, but, might take a longer time to compress

An Ensemble Of Models

Yes!, and is mostly better than a single model. Make this decision based on the complexity of the problem.

These are the few things, I can think, as of now. I will keep updating this, as and when I get to know more. And I say again, these are just a few techniques, which worked for me, might or might not work for you. But, this has been the format for most of the successful previous pre-trained networks.