source:Google Ordinary Neural Networks don’t perform well in cases where sequence of data is important. For example: language translation, sentiment-analysis, time-series and more. To overcome this failure, RNNs were invented. RNN stands for “Recurrent Neural Network”. An RNN cell not only considers its present input but also the output of RNN cells preceding it, for it’s present output. : Simple form of Vanilla RNN’s present state could be represented as ,source: stanford Representation of simple RNN cell RNNs performed very well on sequential data and performed well on tasks where sequence was important. But there exists many problems with ordinary RNNs Vanishing gradients problem: Vanishing Gradient problem 1.tanh 2.derivative of tanh Hyperbolic tangent(tanh) is mostly used as activation function in RNNs which lies in [-1,1] and derivative of tanh lies in [0,1]. During backpropagation, as gradient is calculated by chain rule, it has an effect of multiplying these small numbers (number of times tanh used in rnn architecture) which squeezes the final gradient to almost zero and hence subtracting gradient from weights doesn’t make any change to them which stops the training of model. n times Exploding gradients problem: Opposite to vanishing gradient problem, while following chain rule we multiply with the weight matrix(transposed W )too at each step, and if the values are larger than 1, multiplying a large number to itself many times leads to a very large number leading to explosion of gradient. source: CS231N stanford exploding and vanishing gradients, Long-Term Dependencies problem **Long-term dependency problem, each node represents an rnn cell.**source:Google RNNs are good in handling sequential data but they run into problem when the context is far away. The answer must be ‘ ’ here but if the there are some more words in between ‘ ’ & ‘ ’. It’ll be difficult for RNNs to predict ‘French’. Hence we come to LSTMs. Example: I live France and I know ____. French I live in France I know ____ This is the problem of Long-Term Dependencies. Long Short Term Memory Networks LSTMs are special kind of RNNs with capability of handling Long-Term dependencies. LSTMs also provide solution to Vanishing/Exploding Gradient problem. We’ll discuss later in this article. A simple cell looks like this: LSTM source: stanford RNN vs LSTM cell representation, At start, we need to initialize the weight matrices and bias terms as shown below. Some information about an LSTM cell A simple LSTM cell consists of 4 gates: source:Google 3 LSTM cells connected to each other. source: Google LSTM cell visual representation, source: Stanford CS231N handy information about gates, Let’s discuss the gates: : After getting the output of , , Forget gate helps us to take decisions about what must be removed from h(t-1) state and thus keeping only relevant stuff. It is surrounded by a sigmoid function which helps to crush the input between [0,1].It is represented as: •Forget Gate previous state h(t-1) , src: Google Forget Gate We multiply forget gate with previous cell state to forget the unnecessary stuff from previous state which is not needed anymore, as shown below: In the input gate, we decide to add new stuff from the present input to our present cell state scaled by how much we wish to add them. •Input Gate: ,photo credits: Christopher Olah Input Gate+Gate_gate In the above photo, and . The code is shown below. sigmoid layer decides which values to be updated tanh layer creates a vector for new candidates to added to present cell state To calculate the present cell state, we add the output of ( (input_gate*gate_gate) and forget gate) as shown below. Finally we’ll decide what to output from our cell state which will be done by our sigmoid function. Output Gate: We multiply the input with tanh to crush the values between (-1,1) and then multiply it with the output of sigmoid function so that we only output what we want to. source:Google output Gate, An overall view of what we did. LSTM responds to vanishing and exploding gradient problem in the following way. LSTM has much cleaner backprop compared to Vanilla RNNs **Gradient flows smoothly during Backprop,**source: CS231N stanford •First, There is no multiplication with matrix W during backprop. It’s element wise multiplication with f(forget gate). So it’s time complexity is less. Second, During backprop through each LSTM cell, it’s multiplied by different values of forget fate, which makes it less prone to vanishing/exploding gradient. Though, if values of all forget gates are less than 1, it may suffer from vanishing gradient but in practice people tend to initialise the bias terms with some positive number so in the beginning of training f(forget gate) is very close to 1 and as time passes the model can learn these bias terms. Still, the model may suffer with vanishing gradient problem but chances are very less. The code also implements an example of generating simple sequence from random inputs using LSTMs. • . This article was limited to architecture of LSTM cell but you can see the complete code HERE I tried the program using Deep Learning Studio: Deep Learning Studio comes with inbuilt jupyter notebooks and pre-installed deep learning frameworks such as Tensorflow, Caffe etc.. So you just need to click on to open a jupyter notebook in Deep Learning Studio and you’re ready to go! Notebooks(in the left pane) A special thanks to , team. Christopher Olah Stanford CS231n If you liked the article, do share and clap 😄.For more articles about Deep Learning follow me on and . Medium LinkedIn Thanks for reading. Happy LSTMs. — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — More learning stuff and References: _These loops make recurrent neural networks seem kind of mysterious. However, if you think a bit more, it turns out that…_colah.github.io Understanding LSTM Networks -- colah's blog