Hope you already know the basics of Recurrent Neural Networks (RNN)s. If not, feel free to refer this article. In Natural Language Processing (NLP), the RNNs have played a major role in sequence modeling. Let’s see what RNNs can do and what are their weak points. Without jumping into the topic, let’s move slowly!
In an RNN, we feed the output of the previous timestep as an input in the next timestep. We found that they work well with sequential information like sentences.
For clarity, look at the example below:
If we consider an example of predicting the next word based on previous words in a sentence, using an RNN, the below image depicts how the RNN behaves in each timestep. The entire hidden state is depicted in yellow color rectangle.
You can see that finally the hidden state contains a summary of the sentence.
This is how the neurons are connected and RNN appears.
But at each timestep, they have individual losses. (You can sum up them together to get a single loss value).
If considered backpropagation, in RNN we have an additional parameter called ‘time’ other than the weight matrix. Timestep 5 will propagate the gradient in the usual way. At timestep 4, we have to consider the gradient of timestep 5 also. At timestep 3 we have to consider the gradients of all the timesteps from last timestep.
In this way, in the backpropagation of RNN for timestep r, we have to propagate the gradients of last timestep to timestep r-1.
What you could see from the above?
✔ Rather than having one or two previous words to predict the next word in a sentence, this approach is dependency preserving!
By considering just the previous word for example ‘a’ in above context, neural network may have many possibilities to predict: a river, a student, a car etc. (The possibilities depend on the words in its’ corpus).
But in this RNN-based approach, we have a sequence ‘Anne bought a’ to predict next word. So now it has very low possibilities for the next word to become ‘river’ or ‘student’.
This is a significant outcome that can be gained using an RNN.
So, what’s wrong? Just recall the theoretical stuff I just explained with the example.
Information decay : Although we discuss that RNN remembers previous content, all the internal executions happen in mathematically bounded environments. The hidden layer’s output is a vector which has a maximum size. So that, when the information exceeds that size, the RNN starts forgetting the stuff. This happens over long distances. For example if you look at the given example, in timestep 2 hidden state had no word to remember, in timestep 2 it was only one word. But at timestep 5 it had 4 words. If that exceeds the capacity of the output vector, it will decay the information.
Vanishing gradient : In the backpropagation process, if the gradient of the activation function is a value between 0 and 1 (example : 0.3), the gradients of the last timesteps will be repeatedly multiplied with that value. (For the sake of understanding, just take the gradient of last timestep as 1 and, at timesteps 4,3,2,1 multiply it with the gradient of activation function(0.3) : 1 x 0.3 x 0.3 x 0.3 x 0.3 = 0.0081 . So now you can understand that it may reach 0 in a longer sequence! The gradient vanishes for the time being!
Exploding gradient : This is the opposite of vanishing gradient. Just imagine the gradient of your activation function as 4.75 and the gradient of last timestep as 1 . What happens when backpropagating over 4 timesteps? 1 x 4.75 x 4.75 x 4.75 x 4.75 =509.06. The gradient grows rapidly with the time and it may have a very large value for the time being. If the sequence is very large the gradient may go beyond the value range of data type which is being used and the value may be marked as ‘NaN’, which will make the entire work as mess!
For the time being, LSTM (Long Short Term Memory) was introduced, which is able to address the pitfalls of RNNs!
First Published here
*Lead Image by chenspec from Pixabay*