amr zaki

@theamrzaki

Abstractive Text Summarization (tutorial 2) , Text Representation made very easy

this story is a continuation to the series on how to easily build an abstractive text summarizer , (check out github repo for this series) , today we would go through how you would be able to build a summarizer able to understand words , so we would through representing words to our summarizer

my goal in this series to present the latest novel ways of abstractive text summarization in a simple way , (you can check my overview blog) from

  1. corner stone method of using seq2seq models with attention
  2. to using pointer generator
  3. to using reinforcement learning with deep learning

we would use google colab , so you won’t have to use a powerful computer , nor would you have to download data to your device , as we would connect google drive to google colab to have a fully integrated deep learning experience (you can check my overview on working on free deep learning ecosystem platforms)

All code can be found online through my github repo

This tutorial has been based over the work of https://github.com/dongjun-Lee/text-summarization-tensorflow , they have truly made great work on simplifying the needed work to apply summarization using tensorflow, I have built over their code , to convert it to a python notebook to work on google colab , I truly admire their work

so lets begin !!

1- Setup

1-A To begin we first create a google colab notebook

1- go to https://colab.research.google.com

2- select Google Drive Tab (to save your new google colab to google drive)

3- select New Python 3 Notebook (you can also select python 2 notebook)

a blank notebook would be created to your google drive , it would look like this

You can change the runtime of your notebook from selecting the runtime button in the top menu , to

  1. change which python version you are using
  2. choose a hardware accelerator from ( GPU , TPU )

1-A-A or you can clone the code directly from my github repo

  1. go to https://colab.research.google.com , but this time we would select github tab
  2. then we just paste the this link , and click upload

1-B Now after we are have created our google colab , lets connect to google drive

in the newly created notebook , add a new code cell

then paste this code in it

#https://stackoverflow.com/questions/47744131/colaboratory-can-i-access-to-my-google-drive-folder-and-file
!apt-get install -y -qq software-properties-common python-software-properties module-init-tools
!add-apt-repository -y ppa:alessandro-strada/ppa 2>&1 > /dev/null
!apt-get update -qq 2>&1 > /dev/null
!apt-get -y install -qq google-drive-ocamlfuse fuse
from google.colab import auth
auth.authenticate_user()
from oauth2client.client import GoogleCredentials
creds = GoogleCredentials.get_application_default()
import getpass
!google-drive-ocamlfuse -headless -id={creds.client_id} -secret={creds.client_secret} < /dev/null 2>&1 | grep URL
vcode = getpass.getpass()
!echo {vcode} | google-drive-ocamlfuse -headless -id={creds.client_id} -secret={creds.client_secret}
!mkdir -p drive
!google-drive-ocamlfuse drive

this would connect to your drive , and create a folder that your notebook can access your google drive from

It would ask you for access to your drive , just click on the link , and copy the access token , it would ask this twice

after writing this code , you run the code by clicking on the cell (shift enter) or by clicking the play button on the top of your code cell

then you can simply access any file by its path in form of

path = "drive/test.txt"

1-C Now Lets get the data that we would work on

our data set that we would work on is in form of news and their headlines .

The input would be news content and the output needed would be its summary or in this case would be the headline

There are 2 popular dataset for this task

  1. Amazon Product Reviews
  2. CNN /Daily news dataset (which we would use in our case)

you don’t have to download the data , you can just copy it to your google drive , it would just take some seconds not more.

Here is the Link for the folder containing the data .

Here we would use Copy, URL to Google Drive , which enables you to easily copy files between different google drives

first you would paste the above Link

paste your link , name it , then save to google drive

then you simply click on Save,Copy to Google Drive (after autentication your google drive)

after authenticating , you just click save to google drive

Now after setup process , we can start our work , so lets Begin !!

2- Dependencies and paths

2-a First Lets install needed dependencies

in google colab you are able to install using pip , by simply !pip,

in every code section you simply click on

and then just start writing your code

!pip install gensim
!pip install wget

import nltk
nltk.download('punkt')

2-b then Lets set needed Dependencies

from nltk.tokenize import word_tokenize
import re
import collections
import pickle
import numpy as np
from gensim.models.keyedvectors import KeyedVectors
from gensim.test.utils import get_tmpfile
from gensim.scripts.glove2word2vec import glove2word2vec

2-c Then lets define where the data can be found

#default path for the folder inside google drive
default_path = "drive/Colab Notebooks/Model 2/"
#path for training text (article)
train_article_path = default_path + "sumdata/train/train.article.txt"
#path for training text output (headline)
train_title_path = default_path + "sumdata/train/train.title.txt"
#path for validation text (article)
valid_article_path = default_path + "sumdata/train/valid.article.filter.txt"
#path for validation text output(headline)
valid_title_path = default_path + "sumdata/train/valid.title.filter.txt"

3- Building A Dictionary

for the text summarization to work , you must represent your words in a dictionary format

assume we have an article like

five-time world champion michelle kwan withdrew from the # us figure skating championships on wednesday , but will petition us skating officials for the chance to compete at the # turin olympics #

each word would have a representation in a dict

and we would also need the reverse operation also , like

to apply this we would need some helper functions , like

3-A Simple cleaning data function

the goal of this function would be a simple cleaning of data , just by replacing some unneeded characters with #

def clean_str(sentence):
sentence = re.sub("[#.]+", "#", sentence)
return sentence

this substitution of characters is rather simple , you can of course add multiple substitution steps

3-B Function that actually return text

and apply the above cleaning function

def get_text_list(data_path, toy):
with open (data_path, "r", encoding="utf-8") as f:
if not toy:
return [clean_str(x.strip()) for x in f.readlines()][:200000]
else:
return [clean_str(x.strip()) for x in f.readlines()][:50]

this function would be called for mltiple cases

  1. if you need to load training data
  2. or test data
  3. or if you just need a sample of any of the above by simply setting toy = True

3-C Now lets Build the function that would actually create the needed dictionary

here you would see that we add 4 built-in words , these are essential for the seq2seq algorithim , they are

  1. <padding> this would be used to make the sequences of same length
  2. <unk> this would be used to identify that the word is not found inside the dict
  3. <s> this would be used to identify the beingin of a sentence
  4. </s> this would be used to identify the end of a sentence
copy the code from github , as here the padding is incorrect due to the editor of medium
def build_dict(step, toy=False):
if step == "train":
#First lets load the training data
train_article_list = get_text_list(train_article_path, toy)
train_title_list = get_text_list(train_title_path, toy)
#then lets collect all words from the training data 
#by simply tokenizing each text sample to its words
#here we would use the built-in function imported from nltk toolkit
#which simply return a list of words from a sentence

words = list()
for sentence in train_article_list + train_title_list:
for word in word_tokenize(sentence):
words.append(word)
#we would only select the most common words
word_counter = collections.Counter(words).most_common()
#first lets set the 4 built-in words
word_dict = dict()
word_dict["<padding>"] = 0
word_dict["<unk>"] = 1
word_dict["<s>"] = 2
word_dict["</s>"] = 3
#then lets build our dict , by simply looping over word_co
for word, _ in word_counter:
word_dict[word] = len(word_dict)
#then lets save this to a pickle
with open(default_path + "word_dict.pickle", "wb") as f:
pickle.dump(word_dict, f)
#all of the above was for the training step
#when you are in the validation you can simply load the pickles that
#you have just saved
elif step == "valid":
with open(default_path + "word_dict.pickle", "rb") as f:
word_dict = pickle.load(f)
#for both of the 2 cases (training , or validation) 
#we would create a reversed dict
  reversed_dict = dict(zip(word_dict.values(), word_dict.keys()))
#then we would simply for the 2 cases (training , or validation)
#define a max len for article and for the summary
  article_max_len = 50
summary_max_len = 15
return word_dict, reversed_dict, article_max_len, summary_max_len

4- Now Lets Build Our Dataset

After building the dict for our data , we would begin to build the actual dataset that would be used in our algorithm

Using the above example of an article ,

five-time world champion michelle kwan withdrew from the # us figure skating championships on wednesday , but will petition us skating officials for the chance to compete at the # turin olympics #

the algorthim would need this to be represented in

which is simply getting the collection of word dict for the words in the given sentence

same would occur on the test data

def build_dataset(step, word_dict, article_max_len, summary_max_len, toy=False):
#---case of train
#---we would load both (article , headline) for training
if step == "train":
article_list = get_text_list(train_article_path, toy)
title_list = get_text_list(train_title_path, toy)
#---case of valid
#---we only load articles
elif step == "valid":
article_list = get_text_list(valid_article_path, toy)
#---if step is neither (train nor valid) raise error
else:
raise NotImplementedError
#---(for each aricle) get list of words
#--- so now x (article) contains list of words
x = [word_tokenize(d) for d in article_list]
#---(for each aricle) get index of word from word_dict for each article
#---if not found , use "<unk>" tokken
#---so now we have our train dataset

x = [[word_dict.get(w, word_dict["<unk>"]) for w in d] for d in x]
#---(for each aricle) limit x to article_max_len
x = [d[:article_max_len] for d in x]
#---(for each aricle) if x was less than article_max_len
#--- pad the x by using "<padding>" tokken
x = [d + (article_max_len - len(d)) * [word_dict["<padding>"]] for d in x]



if step == "valid":
return x
else:
#-------if step = "train"
#-------we must do the same steps on headline
#-------but here we don't use the concept of padding
y = [word_tokenize(d) for d in title_list]
y = [[word_dict.get(w, word_dict["<unk>"]) for w in d] for d in y]
y = [d[:(summary_max_len - 1)] for d in y]
return x, y

so lets simply call both (build dict and build dataset)

print("Building dictionary...")
word_dict, reversed_dict, article_max_len, summary_max_len = build_dict("train", False)
print("Loading training dataset...")
train_x, train_y = build_dataset("train", word_dict, article_max_len, summary_max_len, False)

5- Word Embeddings

But we can’t yet feed the our neural network with a list containing the indexes of words , as it would understand them .

We need to represent the word itself in a format that our neural net would understand , and here comes the concept of word embeddings

it is a simple concept , that replaces each word in your dict with a list of numbers , (in our case we would model each word with a 300 float number list)

There are already trained models that have been trained over millions of text to correctly model the words , once you are able to correctly model the words , your neural net would be able to truly understand the text within the article .

A very well known test to identify how well the algorithm understand text after using word embeddings , is applying word similarity on a given word

as you can see , the output tells us that the model would now be capable of understanding the relations between words , which is an extremely important factor in the success of out neural net

5-A lets get the trained model for our work

there is a very well known pretrained model called Glove pre-trained vectors provided by stanford , you can download it from https://nlp.stanford.edu/projects/glove/

or you can simply copy it from my google drive like i have explained before , here is the link for the glove vectors in a pickle format

5-B Build a function to get an array of word embeddings

def get_init_embedding(reversed_dict, embedding_size):
print("Loading Glove vectors...")
#---Load glove model which is in a pickle format 
with open( default_path + "glove/model_glove_300.pkl", 'rb') as handle:
word_vectors = pickle.load(handle)

#---Loop through all words within the reversed_dict
used_words = 0
word_vec_list = list()
for _, word in sorted(reversed_dict.items()):
try:
#-----------if the word i found in the dict ,
#-----------save its value

word_vec = word_vectors.word_vec(word)
used_words += 1
except KeyError:
#-----------else , generate an array of zeros
#-----------of length = embedding_size
#-----------which in this case would be 300
#-----------this is the case also for <padding> and <unk>
#-----------where <s>, </s> token would be zeros
#-----------like seen below

word_vec = np.zeros([embedding_size], dtype=np.float32) #to generate for <padding> and <unk>
#-------add it to the array
#-------remember that we are looping in sorted reversed_dict
#-------so the index of the element inside word_vec_list
#-------would be the same as index of word
#-------no need of a dict , an array is sufficient

word_vec_list.append(word_vec)
#---just print out the percentage of knwon words
print("words found in glove percentage = " + str((used_words/len(word_vec_list))*100) )

#----Assign random vector to <s>, </s> token
word_vec_list[2] = np.random.normal(0, 1, embedding_size)
word_vec_list[3] = np.random.normal(0, 1, embedding_size)
#----then return the array
return np.array(word_vec_list)

to call the function we simply call

word_embedding = get_init_embedding(reversed_dict, 300)

To sum it all UP

so we can say that we have now correctly represented the text for our task of text summarization

so to sum it all up , we have build the code to

by simply calling

word_dict, reversed_dict, article_max_len, summary_max_len = build_dict("train", False)
train_x, train_y = build_dataset("train", word_dict, article_max_len, summary_max_len, False)
word_embedding = get_init_embedding(reversed_dict, 300)

the coming steps in the coming tutorial if GOD wills it , we would go through how to build the model itself , we would build a seq2seq encoder decoder model using LSTM , we would go through the very details of building such a model using tensorflow , this would be the corner stone for the next tutorials in the series , that would go through the latest approaches for this problem from

  1. using pointer generator model
  2. using reinforcement learning with deep learning

don’t forget to clone the code for this tutorial from my repo

and you can take a look on the previous tutorial talking about an overview on text summarization

you can also check this blog talking about the eco system of a free deep learning platform

I truly hope you have enjoyed this tutorial , i am waiting for your feedback , and i am waiting for you in the next tutorial if GOD wills it

More by amr zaki

Topics of interest

More Related Stories