this story is a continuation to the series on how to easily build an abstractive text summarizer , (check out [github repo](https://github.com/theamrzaki/text_summurization_abstractive_methods) for this series) , today we would go through how you would be able to build **a** **summarizer able to understand words** , so we would through representing words to our summarizer my goal in this series to present the latest novel ways of abstractive text summarization in a simple way , (you can check [my overview blog](https://hackernoon.com/text-summarizer-using-deep-learning-made-easy-490880df6cd)) from 1. corner stone method of using seq2seq models with attention 2. to using pointer generator 3. to using reinforcement learning with deep learning we would use google colab , so you won’t have to use a powerful computer , nor would you have to download data to your device , as we would connect google drive to google colab to have a fully integrated deep learning experience (you can check [my overview on working on free deep learning ecosystem platforms](https://hackernoon.com/begin-your-deep-learning-project-for-free-free-gpu-processing-free-storage-free-easy-upload-b4dba18abebc)) All code can be found online through [my github repo](https://github.com/theamrzaki/text_summurization_abstractive_methods) > This tutorial has been based over the work of [https://github.com/dongjun-Lee/text-summarization-tensorflow](https://github.com/dongjun-Lee/text-summarization-tensorflow) , they have truly made great work on simplifying the needed work to apply summarization using tensorflow, I have built over their code , to convert it to a python notebook to work on google colab , I truly admire their work so lets begin !! ### 1- Setup #### 1-A To begin we first create a google colab notebook 1- go to [https://colab.research.google.com](https://colab.research.google.com) 2- select Google Drive Tab (to save your new google colab to google drive) 3- select **New Python 3 Notebook** (you can also select python 2 notebook)  a blank notebook would be created to your google drive , it would look like this  You can change the runtime of your notebook from selecting the runtime button in the top menu , to 1. change which python version you are using 2. choose a hardware accelerator from ( GPU , TPU )   #### 1-A-A or you can clone the code directly from my github repo 1. go to [https://colab.research.google.com](https://colab.research.google.com) , but this time we would select github tab 2. then we just paste the this [link](https://github.com/theamrzaki/text_summurization_abstractive_methods/blob/master/Implementation%20A%20%28seq2seq%20with%20attention%20and%20feature%20rich%20representation%29/Model%202/Model_2.ipynb) , and click upload  #### 1-B Now after we are have created our google colab , lets connect to google drive in the newly created notebook , add a new code cell  then paste this code in it #[https://stackoverflow.com/questions/47744131/colaboratory-can-i-access-to-my-google-drive-folder-and-file](https://stackoverflow.com/questions/47744131/colaboratory-can-i-access-to-my-google-drive-folder-and-file) !apt-get install -y -qq software-properties-common python-software-properties module-init-tools !add-apt-repository -y ppa:alessandro-strada/ppa 2>&1 > /dev/null !apt-get update -qq 2>&1 > /dev/null !apt-get -y install -qq google-drive-ocamlfuse fuse from google.colab import auth auth.authenticate\_user() from oauth2client.client import GoogleCredentials creds = GoogleCredentials.get\_application\_default() import getpass !google-drive-ocamlfuse -headless -id={creds.client\_id} -secret={creds.client\_secret} < /dev/null 2>&1 | grep URL vcode = getpass.getpass() !echo {vcode} | google-drive-ocamlfuse -headless -id={creds.client\_id} -secret={creds.client\_secret} !mkdir -p drive !google-drive-ocamlfuse drive this would connect to your drive , and create a folder that your notebook can access your google drive from It would ask you for access to your drive , just click on the link , and copy the access token , it would ask this twice after writing this code , you run the code by clicking on the cell (shift enter) or by clicking the play button on the top of your code cell  then you can simply access any file by its path in form of path = "drive/test.txt" #### 1-C Now Lets get the data that we would work on our data set that we would work on is in form of **news** and their **headlines .** The input would be **news content** and the output needed would be its summary or in this case would be the **headline** There are 2 popular dataset for this task 1. Amazon Product Reviews 2. CNN /Daily news dataset **(which we would use in our case)** you **don’t have to download** the data , you can just **copy it to your google drive** , it would just take some seconds not more. Here is the [Link](https://drive.google.com/drive/folders/1Izsbg_p1s52dFNh8NmSG5jmDtRgHcLUN?usp=sharing) for the folder containing the data . Here we would use [Copy, URL to Google Drive](https://softgateon.herokuapp.com/urltodrive/ "Go Home") , which enables you to easily copy files between different google drives  first you would paste the above [Link](https://drive.google.com/drive/folders/1Izsbg_p1s52dFNh8NmSG5jmDtRgHcLUN?usp=sharing)  paste your link , name it , then save to google drive then you simply click on Save,Copy to Google Drive (after autentication your google drive)  after authenticating , you just click save to google drive Now after setup process , we can start our work , so lets Begin !! ### 2- Dependencies and paths #### 2-a First Lets install needed dependencies in google colab you are able to install using pip , by simply !pip, in every code section you simply click on  and then just start writing your code !pip install gensim !pip install wget import nltk nltk.download('punkt') #### 2-b then Lets set needed Dependencies from nltk.tokenize import word\_tokenize import re import collections import pickle import numpy as np from gensim.models.keyedvectors import KeyedVectors from gensim.test.utils import get\_tmpfile from gensim.scripts.glove2word2vec import glove2word2vec #### 2-c Then lets define where the data can be found **#default path for the folder inside google drive** default\_path = "drive/Colab Notebooks/Model 2/" **#path for training text (article) **train\_article\_path = default\_path + "sumdata/train/train.article.txt" **#path for training text output (headline) **train\_title\_path = default\_path + "sumdata/train/train.title.txt" **#path for validation text (article) **valid\_article\_path = default\_path + "sumdata/train/valid.article.filter.txt" **#path for validation text output(headline) **valid\_title\_path = default\_path + "sumdata/train/valid.title.filter.txt" ### 3- Building A Dictionary for the text summarization to work , you must represent your words in a dictionary format assume we have an article like > five-time world champion michelle kwan withdrew from the # us figure skating championships on wednesday , but will petition us skating officials for the chance to compete at the # turin olympics # each word would have a representation in a dict  and we would also need the reverse operation also , like  to apply this we would need some helper functions , like #### 3-A Simple cleaning data function the goal of this function would be a simple cleaning of data , just by replacing some unneeded characters with # def clean\_str(sentence): sentence = re.sub("\[#.\]+", "#", sentence) return sentence this substitution of characters is rather simple , you can of course add multiple substitution steps #### 3-B Function that actually return text and apply the above cleaning function def get\_text\_list(data\_path, toy): with open (data\_path, "r", encoding="utf-8") as f: if not toy: return \[clean\_str(x.strip()) for x in f.readlines()\]\[:200000\] else: return \[clean\_str(x.strip()) for x in f.readlines()\]\[:50\] this function would be called for mltiple cases 1. if you need to load training data 2. or test data 3. or if you just need a sample of any of the above by simply setting **toy = True** #### 3-C Now lets Build the function that would actually create the needed dictionary here you would see that we add 4 built-in words , these are essential for the seq2seq algorithim , they are 1. <padding> this would be used to make the sequences of same length 2. <unk> this would be used to identify that the word is not found inside the dict 3. <s> this would be used to identify the beingin of a sentence 4. </s> this would be used to identify the end of a sentence > copy the code [from github](https://github.com/theamrzaki/text_summurization_abstractive_methods/blob/master/Implementation%20A%20%28seq2seq%20with%20attention%20and%20feature%20rich%20representation%29/Model%202/Model_2.ipynb) , as here the padding is incorrect due to the editor of medium def build\_dict(step, toy=False): if step == "train": **#First lets load the training data** train\_article\_list = get\_text\_list(train\_article\_path, toy) train\_title\_list = get\_text\_list(train\_title\_path, toy) **#then lets collect all words from the training data #by simply tokenizing each text sample to its words #here we would use the built-in function imported from nltk toolkit #which simply return a list of words from a sentence** words = list() for sentence in train\_article\_list + train\_title\_list: for word in word\_tokenize(sentence): words.append(word) **#we would only select the most common words** word\_counter = collections.Counter(words).most\_common() **#first lets set the 4 built-in words** word\_dict = dict() word\_dict\["<padding>"\] = 0 word\_dict\["<unk>"\] = 1 word\_dict\["<s>"\] = 2 word\_dict\["</s>"\] = 3 **#then lets build our dict , by simply looping over word\_co** for word, \_ in word\_counter: word\_dict\[word\] = len(word\_dict) **#then lets save this to a pickle** with open(default\_path + "word\_dict.pickle", "wb") as f: pickle.dump(word\_dict, f) **#all of the above was for the training step #when you are in the validation you can simply load the pickles that #you have just saved** elif step == "valid": with open(default\_path + "word\_dict.pickle", "rb") as f: word\_dict = pickle.load(f) **#for both of the 2 cases (training , or validation) #we would create a reversed dict** reversed\_dict = dict(zip(word\_dict.values(), word\_dict.keys())) **#then we would simply for the 2 cases (training , or validation) #define a max len for article and for the summary** article\_max\_len = 50 summary\_max\_len = 15 return word\_dict, reversed\_dict, article\_max\_len, summary\_max\_len ### 4- Now Lets Build Our Dataset After building the dict for our data , we would begin to build the actual dataset that would be used in our algorithm Using the above example of an article , > five-time world champion michelle kwan withdrew from the # us figure skating championships on wednesday , but will petition us skating officials for the chance to compete at the # turin olympics # the algorthim would need this to be represented in  which is simply getting the collection of word dict for the words in the given sentence same would occur on the test data def build\_dataset(step, word\_dict, article\_max\_len, summary\_max\_len, toy=False): **#---case of train #---we would load both (article , headline) for training ** if step == "train": article\_list = get\_text\_list(train\_article\_path, toy) title\_list = get\_text\_list(train\_title\_path, toy) **#---case of valid #---we only load articles ** elif step == "valid": article\_list = get\_text\_list(valid\_article\_path, toy) **#---if step is neither (train nor valid) raise error ** else: raise NotImplementedError **#---(for each aricle) get list of words #--- so now x (article) contains list of words ** x = \[word\_tokenize(d) for d in article\_list\] **#---(for each aricle) get index of word from word\_dict for each article #---if not found , use "<unk>" tokken #---so now we have our train dataset** x = \[\[word\_dict.get(w, word\_dict\["<unk>"\]) for w in d\] for d in x\] **#---(for each aricle) limit x to article\_max\_len ** x = \[d\[:article\_max\_len\] for d in x\] **#---(for each aricle) if x was less than article\_max\_len #--- pad the x by using "<padding>" tokken ** x = \[d + (article\_max\_len - len(d)) \* \[word\_dict\["<padding>"\]\] for d in x\] if step == "valid": return x else: **#-------if step = "train" #-------we must do the same steps on headline #-------but here we don't use the concept of padding ** y = \[word\_tokenize(d) for d in title\_list\] y = \[\[word\_dict.get(w, word\_dict\["<unk>"\]) for w in d\] for d in y\] y = \[d\[:(summary\_max\_len - 1)\] for d in y\] return x, y so lets simply call both (build dict and build dataset) print("Building dictionary...") word\_dict, reversed\_dict, article\_max\_len, summary\_max\_len = build\_dict("train", False) print("Loading training dataset...") train\_x, train\_y = build\_dataset("train", word\_dict, article\_max\_len, summary\_max\_len, False) ### 5- Word Embeddings But we can’t yet feed the our neural network with a list containing the indexes of words , as it would understand them . We need to represent the word itself in a format that our neural net would understand , and here comes the concept of word embeddings it is a simple concept , that replaces each word in your dict with a list of numbers , (in our case we would model each word with a 300 float number list)  There are already trained models that have been trained over millions of text to correctly model the words , once you are able to correctly model the words , your neural net would be able to truly understand the text within the article . A very well known test to identify how well the algorithm understand text after using word embeddings , is applying word similarity on a given word   as you can see , the output tells us that the model would now be capable of understanding the relations between words , which is an extremely important factor in the success of out neural net #### **5-A lets get the trained model for our work** there is a very well known pretrained model called [Glove pre-trained vectors](https://nlp.stanford.edu/projects/glove/) provided by stanford , you can download it from [https://nlp.stanford.edu/projects/glove/](https://nlp.stanford.edu/projects/glove/) or you can simply copy it from my google drive like i have explained before , here is the [link](https://drive.google.com/drive/folders/1qxBKLczcqA5Y682SpZhWX6Z_COrNjMDj?usp=sharing) for the glove vectors in a pickle format #### 5-B Build a function to get an array of word embeddings def get\_init\_embedding(reversed\_dict, embedding\_size): print("Loading Glove vectors...") **#---Load glove model which is in a pickle format ** with open( default\_path + "glove/model\_glove\_300.pkl", 'rb') as handle: word\_vectors = pickle.load(handle) **#---Loop through all words within the reversed\_dict ** used\_words = 0 word\_vec\_list = list() for \_, word in sorted(reversed\_dict.items()): try: **#-----------if the word i found in the dict , #-----------save its value** word\_vec = word\_vectors.word\_vec(word) used\_words += 1 except KeyError: **#-----------else , generate an array of zeros #-----------of length = embedding\_size #-----------which in this case would be 300 #-----------this is the case also for <padding> and <unk> #-----------where <s>, </s> token would be zeros #-----------like seen below** word\_vec = np.zeros(\[embedding\_size\], dtype=np.float32) #to generate for <padding> and <unk> **#-------add it to the array #-------remember that we are looping in sorted reversed\_dict #-------so the index of the element inside word\_vec\_list #-------would be the same as index of word #-------no need of a dict , an array is sufficient** word\_vec\_list.append(word\_vec) **#---just print out the percentage of knwon words ** print("words found in glove percentage = " + str((used\_words/len(word\_vec\_list))\*100) ) **#----Assign random vector to <s>, </s> token ** word\_vec\_list\[2\] = np.random.normal(0, 1, embedding\_size) word\_vec\_list\[3\] = np.random.normal(0, 1, embedding\_size) **#----then return the array **return np.array(word\_vec\_list) to call the function we simply call word\_embedding = get\_init\_embedding(reversed\_dict, 300) ### To sum it all UP so we can say that we have now correctly represented the text for our task of text summarization so to sum it all up , we have build the code to  by simply calling word\_dict, reversed\_dict, article\_max\_len, summary\_max\_len = build\_dict("train", False) train\_x, train\_y = build\_dataset("train", word\_dict, article\_max\_len, summary\_max\_len, False) word\_embedding = get\_init\_embedding(reversed\_dict, 300) the coming steps in the coming tutorial if GOD wills it , we would go through how to build the model itself , we would build a seq2seq encoder decoder model using LSTM , we would go through the very details of building such a model using tensorflow , this would be the corner stone for the next tutorials in the series , that would go through the latest approaches for this problem from 1. using pointer generator model 2. using reinforcement learning with deep learning don’t forget to clone the code for this tutorial from my [repo](https://github.com/theamrzaki/text_summurization_abstractive_methods) and you can take a look on the [previous tutorial](https://medium.com/@theamrzaki/text-summarizer-using-deep-learning-made-easy-490880df6cd) talking about an overview on text summarization you can also check this [blog](https://hackernoon.com/begin-your-deep-learning-project-for-free-free-gpu-processing-free-storage-free-easy-upload-b4dba18abebc) talking about the eco system of a free deep learning platform I truly hope you have enjoyed this tutorial , i am waiting for your feedback , and i am waiting for you in the next tutorial if GOD wills it