Beam Search & Attention for text summarization made easy (Tutorial 5) by@theamrzaki
3,598 reads

Beam Search & Attention for text summarization made easy (Tutorial 5)

Read on Terminal Reader

Too Long; Didn't Read

Company Mentioned

Mention Thumbnail
featured image - Beam Search & Attention for text summarization made easy (Tutorial 5)
amr zaki HackerNoon profile picture


amr zaki
react to story with heart

This tutorial is the fifth one from a series of tutorials that would help you build an abstractive text summarizer using tensorflow , today we would discuss some useful modification to the core RNN seq2seq model we have covered in the last tutorial

These Modifications are

  1. Beam Search
  2. Attention Model

About Series

This is a series of tutorials that would help you build an abstractive text summarizer using tensorflow using multiple approaches , you don’t need to download the data nor do you need to run the code locally on your device , as data is found on google drive , (you can simply copy it to your google drive , learn more here) , and the code for this series is written in Jupyter notebooks to run on google colab can be found here

We have covered so far (code for this series can be found here)

0. Overview on the free ecosystem for deep learning (how to use google colab with google drive)

  1. Overview on the text summarization task and the different techniques for the task
  2. Data used and how it could be represented for our task
  3. What is seq2seq for text summarization and why
  4. Mulitlayer Bidirectional LSTM/GRU

so lets get started

Quick Recap

Our task is of text summarization , we call it abstractive as we teach the neural network to generate words not to merely copy words .

the data that would be used would be news and their headers , it can be found on my google drive , so you just copy it to your google drive without the need to download it (more on this)

We would represent the data using word embeddings , which is simply converting each word to a specific vector , we would create a dictionary for our words (more on this)

There are different approaches for this task , they are built over a corner stone concept , and they keep on developing and building up , they start by working on a type of network called RNN , which is arranged in an Encoder/Decoder architecture called seq2seq (more on this), then we would build the seq2seq in a mulitlayer bidirectional structure , where the rnn cell would be a LSTM cell (more on this) , the code for these different approaches can be found here

This tutorial has been based by the amazing work of Andrew NG , his course on RNN has been truly useful , i recommend you to see it

Today we would go through some modifications made to the core component of the encoder/decoder model , these modifications helps the network choose the best results from a pool of different possibilities , which is known as Beam search , also we would talk about Attention Model , which is a simple network added to our architecture to help it pay more attention to specific words to help it output better summarizes .

So lets get started !!

1. Beam Search

1.A Intuition (why beam search)

Our task of text summarization can be seen as a conditional language model , meaning that we generate an output given an input sentence , so the output is conditioned on the input sentence , so this is why it is called conditional .


This can be seen as a contrary to a simple language model , as a normal language model only outputs the probability of a certain sentence , this can be used to generate novel sentences ,( if its input can be zero) , but in our case , it would select the most likely output given an input sentence.

So our architecture is actually divided into 2 parts , an encoder , and a decoder (like discussed here) , as the encoder would represent the input sentence as a vector , and pass it to the next part of the architecture (decoder)


But a problem would arise in the decoder , which sentence to select , as there could be a huge number of outputs for a certain input sentence


so how to choose the most likely sentence from this large pool of possible outputs ?

  1. Jane is visiting Africa in September
  2. Jane is going to be visiting Africa in September
  3. In September Jane will visit Africa

One solution could be by generating 1 word at a time , generating the first most likely word , then generating the next most word and so … , this is called greedy search , but this tends not to be the most optimized approach , due to 2 main reasons

  1. in practice , greedy search doesn’t perform well
  2. in each step we must take into consideration all possibilities of all words in the dict , if you have 10k words , then in each step you would have to consider 10k^ 10k (10k to the power 10k)

so we would go to a more optimized approximate search approach → Beam Search

1.B How to Beam Search

Now that we have understood the basic intuition behind beam search , how to actually implement it ?

Simply beam search differs from basic greedy search by considering multiple options in every step , not just 1 option , these number of options is controlled by a variable called Beam Width

In greedy search :


in each step we only consider the most likely output at each step , but this not always guarantee the best solution as


greedy search would output that the best solution is Jane is going , while Jane is visiting is better output

this is because the probability of the word “going” after “is” is bigger than the word “visiting” after is , so this is why we need to consider another approach not greedy search , an approach that would take multiple words into consideration

So in Beam Search it would be (Beam Width=3)


in step 1 , we would choose the 3 most likely words , then for each of the 3 selected words , we would get the most 3 likely words , then the same logic would occur on these 3 output sentences till we build our best output.

So if our vocab is 10k , for the each step we tried to look for the most likely word from 10k vocab , so by this we only considered 30k words

1.C Effect of Beam Width

when Beam Width = 3 → consider 3 words each step

when Beam Width = 1 → consider 1 word each step → greedy search

As Beam width increases → better results → but would require more computational resources

2. Attention Model

2.A Attention Intuition

When we as humans summarize text , we actually look at couple of words at a time , not the whole text to summarize at a given instance , this is what we are trying to teach our model .

We try to teach our model to only pay attention to the neighboring words not the whole text , but what is the function of this attention , we actually don’t know !! , so this is why we would construct a simple neural network for this exact task .

2.B Attention Structure

Here we would work on our seq2seq encoder decoder structure (more on this) , we would work on a bidirectional encoder (more on this) , our work would actually occur on a new interface between the encoder and the decoder , this new interface is called context vector , which actually represent the amount of attention given to words


Encoder would be



So here to calculate the context C1 that would be the input to the first word in the decoder , we would calculate multiple alpha parameters , so now we would need to know how to get these alpha parameters

2.C Calculate Attention Parameters (context and alpha)

Now that we understood the interface between the encoder and the decoder through the context vector which is actually calculated through attention alpha parameters , we would need to know more about how this is calculated

First : lets go through how to calculate the context vector

don’t forget that we are working on bidirectional encoder , so the activation is actually divided into 2 parts left & right (more on this)



so the attention would be

so the attention have 2 parts , a left and right to form the activation parameter of each cell


then the context would be weighted sum of the activations (t stands for number of cells in the encoder ) where the weights are the alpha parameters , this would occur for each context vector

Second : We need to know how to calculate the alpha itself

Alpha is


t stands for the time step in the output , while t’ stands for the time step in input

Sum of all alphas is 1

there is a formula that ensures that the sum is 1


then we just need to know what is ‘e’

by intuition we can say that the attention actually depends on

  1. the current activation of the input ( alpha<t’>)
  2. the previous state of the last time step in the decoder (s<t-1>)

but we actually don’t know the real function between them , so we simply build a simple neural network to learn this relation for us. , the output from this network would be ‘e’ parameter that would be used to calculate the alpha attention parameter


so the whole scenario would be


Next Time if GOD wills it , we would go through how to truly implement a text summarization model , we would go through the steps to implement this model which uses implements a multilayer bidirectional LSTM with beam search and attention in tensor flow , the code comes from dongjun-Lee , i have built it into a google colab notebook , and hosted the data to google drive , so that you won’t need to download the data nor would you need to run the code on your computer , you simply run on google colab and connect google colab to your google drive .

I truly hope you have enjoyed reading this tutorial , and i hope i have made these concepts clear , all the code for this series of tutorials are found here , you can simply use google colab to run it , please review the tutorial and tell me what do you think about it , hope to see you again.


. . . comments & more!
Hackernoon hq - po box 2206, edwards, colorado 81632, usa