This tutorial is the forth one from a series of tutorials that would help you build an abstractive text summarizer using tensorflow , today we would discuss some useful modification to the core RNN seq2seq model we have covered in the last tutorial
These Modifications are
This is a series of tutorials that would help you build an abstractive text summarizer using tensorflow using multiple approaches , you don’t need to download the data nor do you need to run the code locally on your device , as data is found on google drive , (you can simply copy it to your google drive , learn more here) , and the code for this series is written in Jupyter notebooks to run on google colab can be found here
We have covered so far (code for this series can be found here)
0. Overview on the free ecosystem for deep learning (how to use google colab with google drive)
so lets get started
Our task is of text summarization, we call it abstractive as we teach the neural network to generate words not just copy words .
The data that would be used would be news and their headers , it can be found on my google drive, so you just copy it to your google drive without the need to download it (more on this)
We would represent the data using word embeddings , which is simply converting each word to a specific vector , we would create a dictionary for our words (more on this)
There are different approaches for this task , they are built over a corner stone concept , and they keep on developing and building up , they start by working on a type of network called RNN , which is arranged in an Encoder/Decoder architecture called seq2seq (more on this), the code for these different approaches can be found here
This tutorial has been based by the amazing work of Andrew NG , his course on RNN has been truly useful, i recommend you to see it
Today we would go through some modifications made to the core component of the encoder/decoder model , these modifications occur on the RNN block itself , to increase its efficiency in the whole model.
There are 2 main problems with the RNN unit
this is quite important when dealing with a nlp problem , as some words depends on words that appeared very early in the sentence like
Here the word cat/cats which appeared early in the sentence would directly affect choosing either was/were later in the sentence.
to solve this problem we would need a new RNN architecture , here we would discuss 2 main approaches :
Both GRU & LSTM solves the problem of vanishing gradients that normal RNN unit suffers from , they do it by implementing a memory cell within their network , this enables them to store data from early within the sequence to be used later within the sequence.
Here we would talk about GRU (gated recurrent unit) , we begin with the activation equation of RNN (more on this)
then we would apply some simple modifications to it
till we finally have
c here denotes for the memory cell , here it would be the output of the GRU cell .
The N sub letter denotes that it is the newly proposed c value (we would use it latter to generate the real c output of the GRU .
so here the new proposed output c (candidate), would depend on the old output c (old candidate) , and the current input at that time
To remember the value of C (candidate), we use another parameter called F (gate update) , this would control whether we would update the value of c or not
here we would use a sigmoid function , we would take into consideration the old c , and the current input X
so to update the value of C we would use
lets assume that C is a vector , that its first element would remember important features within the sentence , here we would assume that this feature is whether the word is cat or cats
so at first the c vector is empty , till we see the word cat , then F would be set to 1 to remember that it is a singular word , and it would keep its value until it is used later in the sentence (to generate ‘was’ not ‘were’)
there is just another modification that is needed to build our full GRU unit , it occurs on the function needed to create the new candidate C .
Here we would have a learnable (Fr) parameter to learn the relevance between C new and C old
so to sum it all up we have 4 main equations that govern GRU
LSTM is another modification to RNN , it is also build using the same concept of memory , to remember long sequences of data , it was built proposed before GRU , so GRU is actually a simplification to LSTM
Here in LSTM ,
so to calculate the new candidate
here in LSTM we control the memory cell through 3 different gates
as we said before we have 2 outputs from LSTM , the new candidate and a new activation , in them we would use the previous gates
To combine all of these together
we could also output y prediction from LSTM (by passing them to softmax )
when we connect multiple LSTMs together , we can see that if the network correctly learned the gates parameters , we could pass the candidate values (red values) from early from the sequence to the very end of the sequence , so we can model long dependencies with high accuracy
this is a modification made on the normal RNN network to make it able to adjust to an important need in nlp problems ,
as in nlp , sometimes to understand a word we need not just to the previous word , but also to the coming word , like in this example
Here to differ between the 2 different meanings of the word teddy (one time it is part of a person name , while the other is part of the word bear ) we would need to look for the coming word , so this is the reason why we need to apply bidirectional networks
Bidirectional networks is a general architecture that can utilize any RNN model (normal RNN , GRU , LSTM)
forward propagation for the 2 direction of cells
Here we apply forward propagation 2 times , one for the forward cells and one for the backward cells
Both activations (forward , backward) would be considered to calculate the output y^ at time t
To achieve even greater results , we can stack multiple RNN(LSTM or GRU or normal RNN) on top of each other , but we must take into consideration that they work with time .
So to get started , here is a normal deep network , we can see that it contains multiple layers (50 in this case) , while when we apply the same concept on RNN , we tend to choose much smaller number of layers , as it would be enough and because it would be computationaly excpensive
now lets see how would we apply the concept of deep networks with RNN
as we can see , since we are working on RNN or its variations , we must take into consideration the time factor , so each vertical column of cells represent a layer , while each progress in time we repeat this column
so our notation would be [layer] <time>
To get the value of any activation layer , we use both
Next Time if GOD wills it , we would go through how to enhance our architecture even more using the concepts of
I truly hope you have enjoyed reading this tutorial , and i hope i have made these concepts clear , all the code for this series of tutorials are found here , you can simply use google colab to run it , please review the tutorial and tell me what do you think about it , hope to see you again