Abstractive Text Summarization (tutorial 2) , Text Representation made very easy

this story is a continuation to the series on how to easily build an abstractive text summarizer , (check out github repo for this series) , today we would go through how you would be able to build a summarizer able to understand words , so we would through representing words to our summarizer

my goal in this series to present the latest novel ways of abstractive text summarization in a simple way , (you can check my overview blog) from

corner stone method of using seq2seq models with attention
to using pointer generator
to using reinforcement learning with deep learning

we would use google colab , so you won’t have to use a powerful computer , nor would you have to download data to your device , as we would connect google drive to google colab to have a fully integrated deep learning experience (you can check my overview on working on free deep learning ecosystem platforms)

All code can be found online through my github repo

This tutorial has been based over the work of https://github.com/dongjun-Lee/text-summarization-tensorflow , they have truly made great work on simplifying the needed work to apply summarization using tensorflow, I have built over their code , to convert it to a python notebook to work on google colab , I truly admire their work

so lets begin !!

1- Setup

1-A To begin we first create a google colab notebook

1- go to https://colab.research.google.com

2- select Google Drive Tab (to save your new google colab to google drive)

3- select New Python 3 Notebook (you can also select python 2 notebook)

a blank notebook would be created to your google drive , it would look like this

You can change the runtime of your notebook from selecting the runtime button in the top menu , to

change which python version you are using
choose a hardware accelerator from ( GPU , TPU )

1-A-A or you can clone the code directly from my github repo

go to https://colab.research.google.com , but this time we would select github tab
then we just paste the this link , and click upload

1-B Now after we are have created our google colab , lets connect to google drive

in the newly created notebook , add a new code cell

then paste this code in it

#https://stackoverflow.com/questions/47744131/colaboratory-can-i-access-to-my-google-drive-folder-and-file

!apt-get install -y -qq software-properties-common python-software-properties module-init-tools!add-apt-repository -y ppa:alessandro-strada/ppa 2>&1 > /dev/null!apt-get update -qq 2>&1 > /dev/null!apt-get -y install -qq google-drive-ocamlfuse fusefrom google.colab import authauth.authenticate_user()from oauth2client.client import GoogleCredentialscreds = GoogleCredentials.get_application_default()import getpass!google-drive-ocamlfuse -headless -id={creds.client_id} -secret={creds.client_secret} < /dev/null 2>&1 | grep URLvcode = getpass.getpass()!echo {vcode} | google-drive-ocamlfuse -headless -id={creds.client_id} -secret={creds.client_secret}

!mkdir -p drive!google-drive-ocamlfuse drive

this would connect to your drive , and create a folder that your notebook can access your google drive from

It would ask you for access to your drive , just click on the link , and copy the access token , it would ask this twice

after writing this code , you run the code by clicking on the cell (shift enter) or by clicking the play button on the top of your code cell

then you can simply access any file by its path in form of

path = "drive/test.txt"

1-C Now Lets get the data that we would work on

our data set that we would work on is in form of news and their headlines .

The input would be news content and the output needed would be its summary or in this case would be the headline

There are 2 popular dataset for this task

Amazon Product Reviews
CNN /Daily news dataset (which we would use in our case)

you don’t have to download the data , you can just copy it to your google drive , it would just take some seconds not more.

Here is the Link for the folder containing the data .

Here we would use Copy, URL to Google Drive , which enables you to easily copy files between different google drives

first you would paste the above Link

paste your link , name it , then save to google drive

then you simply click on Save,Copy to Google Drive (after autentication your google drive)

after authenticating , you just click save to google drive

Now after setup process , we can start our work , so lets Begin !!

2- Dependencies and paths

2-a First Lets install needed dependencies

in google colab you are able to install using pip , by simply !pip,

in every code section you simply click on

and then just start writing your code

!pip install gensim!pip install wget

import nltknltk.download('punkt')

2-b then Lets set needed Dependencies

from nltk.tokenize import word_tokenizeimport reimport collectionsimport pickleimport numpy as npfrom gensim.models.keyedvectors import KeyedVectorsfrom gensim.test.utils import get_tmpfilefrom gensim.scripts.glove2word2vec import glove2word2vec

2-c Then lets define where the data can be found

#default path for the folder inside google drivedefault_path = "drive/Colab Notebooks/Model 2/"

**#path for training text (article)**train_article_path = default_path + "sumdata/train/train.article.txt"

**#path for training text output (headline)**train_title_path = default_path + "sumdata/train/train.title.txt"

**#path for validation text (article)**valid_article_path = default_path + "sumdata/train/valid.article.filter.txt"

**#path for validation text output(headline)**valid_title_path = default_path + "sumdata/train/valid.title.filter.txt"

3- Building A Dictionary

for the text summarization to work , you must represent your words in a dictionary format

assume we have an article like

five-time world champion michelle kwan withdrew from the # us figure skating championships on wednesday , but will petition us skating officials for the chance to compete at the # turin olympics #

each word would have a representation in a dict

and we would also need the reverse operation also , like

to apply this we would need some helper functions , like

3-A Simple cleaning data function

the goal of this function would be a simple cleaning of data , just by replacing some unneeded characters with #

def clean_str(sentence):sentence = re.sub("[#.]+", "#", sentence)return sentence

this substitution of characters is rather simple , you can of course add multiple substitution steps

3-B Function that actually return text

and apply the above cleaning function

def get_text_list(data_path, toy):with open (data_path, "r", encoding="utf-8") as f:if not toy:return [clean_str(x.strip()) for x in f.readlines()][:200000]else:return [clean_str(x.strip()) for x in f.readlines()][:50]

this function would be called for mltiple cases

if you need to load training data
or test data
or if you just need a sample of any of the above by simply setting toy = True

3-C Now lets Build the function that would actually create the needed dictionary

here you would see that we add 4 built-in words , these are essential for the seq2seq algorithim , they are

<padding> this would be used to make the sequences of same length
<unk> this would be used to identify that the word is not found inside the dict
<s> this would be used to identify the beingin of a sentence
</s> this would be used to identify the end of a sentence

copy the code from github , as here the padding is incorrect due to the editor of medium

def build_dict(step, toy=False):if step == "train":#First lets load the training datatrain_article_list = get_text_list(train_article_path, toy)train_title_list = get_text_list(train_title_path, toy)

#then lets collect all words from the training data#by simply tokenizing each text sample to its words#here we would use the built-in function imported from nltk toolkit#which simply return a list of words from a sentencewords = list()for sentence in train_article_list + train_title_list:for word in word_tokenize(sentence):words.append(word)

#we would only select the most common wordsword_counter = collections.Counter(words).most_common()#first lets set the 4 built-in wordsword_dict = dict()word_dict["<padding>"] = 0word_dict["<unk>"] = 1word_dict["<s>"] = 2word_dict["</s>"] = 3

#then lets build our dict , by simply looping over word_cofor word, _ in word_counter:word_dict[word] = len(word_dict)

#then lets save this to a picklewith open(default_path + "word_dict.pickle", "wb") as f:pickle.dump(word_dict, f)

#all of the above was for the training step#when you are in the validation you can simply load the pickles that#you have just saved

elif step == "valid":with open(default_path + "word_dict.pickle", "rb") as f:word_dict = pickle.load(f)

#for both of the 2 cases (training , or validation)#we would create a reversed dict

reversed_dict = dict(zip(word_dict.values(), word_dict.keys()))

#then we would simply for the 2 cases (training , or validation)#define a max len for article and for the summary

article_max_len = 50summary_max_len = 15return word_dict, reversed_dict, article_max_len, summary_max_len

4- Now Lets Build Our Dataset

After building the dict for our data , we would begin to build the actual dataset that would be used in our algorithm

Using the above example of an article ,

five-time world champion michelle kwan withdrew from the # us figure skating championships on wednesday , but will petition us skating officials for the chance to compete at the # turin olympics #

the algorthim would need this to be represented in

which is simply getting the collection of word dict for the words in the given sentence

same would occur on the test data

def build_dataset(step, word_dict, article_max_len, summary_max_len, toy=False):**#---case of train#---we would load both (article , headline) for training** if step == "train":article_list = get_text_list(train_article_path, toy)title_list = get_text_list(train_title_path, toy)**#---case of valid#---we only load articles** elif step == "valid":article_list = get_text_list(valid_article_path, toy)**#---if step is neither (train nor valid) raise error** else:raise NotImplementedError**#---(for each aricle) get list of words#--- so now x (article) contains list of words** x = [word_tokenize(d) for d in article_list]

#---(for each aricle) get index of word from word_dict for each article#---if not found , use "<unk>" tokken#---so now we have our train datasetx = [[word_dict.get(w, word_dict["<unk>"]) for w in d] for d in x]

**#---(for each aricle) limit x to article_max_len** x = [d[:article_max_len] for d in x]

**#---(for each aricle) if x was less than article_max_len#--- pad the x by using "<padding>" tokken** x = [d + (article_max_len - len(d)) * [word_dict["<padding>"]] for d in x]

if step == "valid":  
    return x  
else:

**#-------if step = "train"#-------we must do the same steps on headline#-------but here we don't use the concept of padding** y = [word_tokenize(d) for d in title_list]y = [[word_dict.get(w, word_dict["<unk>"]) for w in d] for d in y]y = [d[:(summary_max_len - 1)] for d in y]return x, y

so lets simply call both (build dict and build dataset)

print("Building dictionary...")word_dict, reversed_dict, article_max_len, summary_max_len = build_dict("train", False)

print("Loading training dataset...")train_x, train_y = build_dataset("train", word_dict, article_max_len, summary_max_len, False)

5- Word Embeddings

But we can’t yet feed the our neural network with a list containing the indexes of words , as it would understand them .

We need to represent the word itself in a format that our neural net would understand , and here comes the concept of word embeddings

it is a simple concept , that replaces each word in your dict with a list of numbers , (in our case we would model each word with a 300 float number list)

There are already trained models that have been trained over millions of text to correctly model the words , once you are able to correctly model the words , your neural net would be able to truly understand the text within the article .

A very well known test to identify how well the algorithm understand text after using word embeddings , is applying word similarity on a given word

as you can see , the output tells us that the model would now be capable of understanding the relations between words , which is an extremely important factor in the success of out neural net

5-A lets get the trained model for our work

there is a very well known pretrained model called Glove pre-trained vectors provided by stanford , you can download it from https://nlp.stanford.edu/projects/glove/

or you can simply copy it from my google drive like i have explained before , here is the link for the glove vectors in a pickle format

5-B Build a function to get an array of word embeddings

def get_init_embedding(reversed_dict, embedding_size):

print("Loading Glove vectors...")

**#---Load glove model which is in a pickle format** with open( default_path + "glove/model_glove_300.pkl", 'rb') as handle:word_vectors = pickle.load(handle)

**#---Loop through all words within the reversed_dict** used_words = 0word_vec_list = list()for _, word in sorted(reversed_dict.items()):try:#-----------if the word i found in the dict ,#-----------save its valueword_vec = word_vectors.word_vec(word)used_words += 1except KeyError:#-----------else , generate an array of zeros#-----------of length = embedding_size#-----------which in this case would be 300#-----------this is the case also for <padding> and <unk>#-----------where <s>, </s> token would be zeros#-----------like seen belowword_vec = np.zeros([embedding_size], dtype=np.float32) #to generate for <padding> and <unk>

#-------add it to the array#-------remember that we are looping in sorted reversed_dict#-------so the index of the element inside word_vec_list#-------would be the same as index of word#-------no need of a dict , an array is sufficientword_vec_list.append(word_vec)

**#---just print out the percentage of knwon words** print("words found in glove percentage = " + str((used_words/len(word_vec_list))*100) )

**#----Assign random vector to <s>, </s> token** word_vec_list[2] = np.random.normal(0, 1, embedding_size)word_vec_list[3] = np.random.normal(0, 1, embedding_size)

**#----then return the array**return np.array(word_vec_list)

to call the function we simply call

word_embedding = get_init_embedding(reversed_dict, 300)

To sum it all UP

so we can say that we have now correctly represented the text for our task of text summarization

so to sum it all up , we have build the code to

by simply calling

word_dict, reversed_dict, article_max_len, summary_max_len = build_dict("train", False)

train_x, train_y = build_dataset("train", word_dict, article_max_len, summary_max_len, False)

word_embedding = get_init_embedding(reversed_dict, 300)

the coming steps in the coming tutorial if GOD wills it , we would go through how to build the model itself , we would build a seq2seq encoder decoder model using LSTM , we would go through the very details of building such a model using tensorflow , this would be the corner stone for the next tutorials in the series , that would go through the latest approaches for this problem from

using pointer generator model
using reinforcement learning with deep learning

don’t forget to clone the code for this tutorial from my repo

and you can take a look on the previous tutorial talking about an overview on text summarization

you can also check this blog talking about the eco system of a free deep learning platform

I truly hope you have enjoyed this tutorial , i am waiting for your feedback , and i am waiting for you in the next tutorial if GOD wills it