amr zaki

@theamrzaki

Multilayer Bidirectional LSTM/GRU for text summarization made easy (tutorial 4)

This tutorial is the forth one from a series of tutorials that would help you build an abstractive text summarizer using tensorflow , today we would discuss some useful modification to the core RNN seq2seq model we have covered in the last tutorial

These Modifications are

  1. RNN modifications (GRU & LSTM)
  2. Bidirectional networks
  3. Multilayer networks

About Series

This is a series of tutorials that would help you build an abstractive text summarizer using tensorflow using multiple approaches , you don’t need to download the data nor do you need to run the code locally on your device , as data is found on google drive , (you can simply copy it to your google drive , learn more here) , and the code for this series is written in Jupyter notebooks to run on google colab can be found here

We have covered so far (code for this series can be found here)

0. Overview on the free ecosystem for deep learning (how to use google colab with google drive)

  1. Overview on the text summarization task and the different techniques for the task
  2. Data used and how it could be represented for our task
  3. What is seq2seq for text summarization and why

so lets get started

Quick Recap

Our task is of text summarization, we call it abstractive as we teach the neural network to generate words not just copy words .

The data that would be used would be news and their headers , it can be found on my google drive, so you just copy it to your google drive without the need to download it (more on this)

We would represent the data using word embeddings , which is simply converting each word to a specific vector , we would create a dictionary for our words (more on this)

There are different approaches for this task , they are built over a corner stone concept , and they keep on developing and building up , they start by working on a type of network called RNN , which is arranged in an Encoder/Decoder architecture called seq2seq (more on this), the code for these different approaches can be found here

This tutorial has been based by the amazing work of Andrew NG , his course on RNN has been truly useful, i recommend you to see it

Today we would go through some modifications made to the core component of the encoder/decoder model , these modifications occur on the RNN block itself , to increase its efficiency in the whole model.

1. RNN modifications (LSTM & GRU)

There are 2 main problems with the RNN unit

  1. Exploding Gradients : Occurs with deep networks (i.e: networks with many layers like in our case) , when we apply back propagation, the gradients would get too large . Actually this error can be solved rather easy , using the concept of gradient clipping , which is simply setting a specific threshold , that when the gradients exceed it , we would clip it to a certain value .
  2. Vanishing Gradients : This proves a much harder problem to solve , this also occurs due to large number of layers , but this comes from the inability of the normal RNN unit to remember old values that appeared early in the sequence
this is quite important when dealing with a nlp problem , as some words depends on words that appeared very early in the sentence like
Here the word cat/cats which appeared early in the sentence would directly affect choosing either was/were later in the sentence.

to solve this problem we would need a new RNN architecture , here we would discuss 2 main approaches :

  1. GRU (Gated Recurrent Unit)
  2. LSTM (Long Short term Memory)

1.A) GRU (Gated Recurrent Unit)

Both GRU & LSTM solves the problem of vanishing gradients that normal RNN unit suffers from , they do it by implementing a memory cell within their network , this enables them to store data from early within the sequence to be used later within the sequence.

Here we would talk about GRU (gated recurrent unit) , we begin with the activation equation of RNN (more on this)

then we would apply some simple modifications to it

till we finally have

c here denotes for the memory cell , here it would be the output of the GRU cell .

The N sub letter denotes that it is the newly proposed c value (we would use it latter to generate the real c output of the GRU .

so here the new proposed output c (candidate), would depend on the old output c (old candidate) , and the current input at that time

To remember the value of C (candidate), we use another parameter called F (gate update) , this would control whether we would update the value of c or not

here we would use a sigmoid function , we would take into consideration the old c , and the current input X

so to update the value of C we would use

lets assume that C is a vector , that its first element would remember important features within the sentence , here we would assume that this feature is whether the word is cat or cats

so at first the c vector is empty , till we see the word cat , then F would be set to 1 to remember that it is a singular word , and it would keep its value until it is used later in the sentence (to generate ‘was’ not ‘were’)

there is just another modification that is needed to build our full GRU unit , it occurs on the function needed to create the new candidate C .

Here we would have a learnable (Fr) parameter to learn the relevance between C new and C old

so to sum it all up we have 4 main equations that govern GRU

1.B) LSTM (Long Short Term Memory)

LSTM is another modification to RNN , it is also build using the same concept of memory , to remember long sequences of data , it was built proposed before GRU , so GRU is actually a simplification to LSTM

Here in LSTM ,

  1. we use activation values , not just C (candidate values ) ,
  2. we also have 2 outputs from the cell , a new activation , and a new candidate value

so to calculate the new candidate

here in LSTM we control the memory cell through 3 different gates

as we said before we have 2 outputs from LSTM , the new candidate and a new activation , in them we would use the previous gates

To combine all of these together

we could also output y prediction from LSTM (by passing them to softmax )

when we connect multiple LSTMs together , we can see that if the network correctly learned the gates parameters , we could pass the candidate values (red values) from early from the sequence to the very end of the sequence , so we can model long dependencies with high accuracy

2. Bidirectional networks

this is a modification made on the normal RNN network to make it able to adjust to an important need in nlp problems ,

as in nlp , sometimes to understand a word we need not just to the previous word , but also to the coming word , like in this example

Here to differ between the 2 different meanings of the word teddy (one time it is part of a person name , while the other is part of the word bear ) we would need to look for the coming word , so this is the reason why we need to apply bidirectional networks

Bidirectional networks is a general architecture that can utilize any RNN model (normal RNN , GRU , LSTM)

forward propagation for the 2 direction of cells

Here we apply forward propagation 2 times , one for the forward cells and one for the backward cells

Both activations (forward , backward) would be considered to calculate the output y^ at time t

3. Multilayer networks

To achieve even greater results , we can stack multiple RNN(LSTM or GRU or normal RNN) on top of each other , but we must take into consideration that they work with time .

So to get started , here is a normal deep network , we can see that it contains multiple layers (50 in this case) , while when we apply the same concept on RNN , we tend to choose much smaller number of layers , as it would be enough and because it would be computationaly excpensive

now lets see how would we apply the concept of deep networks with RNN

as we can see , since we are working on RNN or its variations , we must take into consideration the time factor , so each vertical column of cells represent a layer , while each progress in time we repeat this column

so our notation would be [layer] <time>

To get the value of any activation layer , we use both

  1. Previous activation in time (time 2 ) from the same layer (layer 2) 💚 green
  2. previous cell in the same time (time 3) in the previous layer (layer 1) 🔵 blue

Next Time if GOD wills it , we would go through how to enhance our architecture even more using the concepts of

  1. Beam Search
  2. Attention Model
I truly hope you have enjoyed reading this tutorial , and i hope i have made these concepts clear , all the code for this series of tutorials are found here , you can simply use google colab to run it , please review the tutorial and tell me what do you think about it , hope to see you again

More by amr zaki

Topics of interest

More Related Stories