this story is a continuation to the series on how to easily build an abstractive text summarizer , (check out for this series) , today we would go through how you would be able to build , so we would through representing words to our summarizer github repo a summarizer able to understand words my goal in this series to present the latest novel ways of abstractive text summarization in a simple way , (you can check ) from my overview blog corner stone method of using seq2seq models with attention to using pointer generator to using reinforcement learning with deep learning we would use google colab , so you won’t have to use a powerful computer , nor would you have to download data to your device , as we would connect google drive to google colab to have a fully integrated deep learning experience (you can check ) my overview on working on free deep learning ecosystem platforms All code can be found online through my github repo This tutorial has been based over the work of , they have truly made great work on simplifying the needed work to apply summarization using tensorflow, I have built over their code , to convert it to a python notebook to work on google colab , I truly admire their work https://github.com/dongjun-Lee/text-summarization-tensorflow so lets begin !! 1- Setup 1-A To begin we first create a google colab notebook 1- go to https://colab.research.google.com 2- select Google Drive Tab (to save your new google colab to google drive) 3- select (you can also select python 2 notebook) New Python 3 Notebook a blank notebook would be created to your google drive , it would look like this You can change the runtime of your notebook from selecting the runtime button in the top menu , to change which python version you are using choose a hardware accelerator from ( GPU , TPU ) 1-A-A or you can clone the code directly from my github repo go to , but this time we would select github tab https://colab.research.google.com then we just paste the this , and click upload link 1-B Now after we are have created our google colab , lets connect to google drive in the newly created notebook , add a new code cell then paste this code in it # https://stackoverflow.com/questions/47744131/colaboratory-can-i-access-to-my-google-drive-folder-and-file !apt-get install -y -qq software-properties-common python-software-properties module-init-tools!add-apt-repository -y ppa:alessandro-strada/ppa 2>&1 > /dev/null!apt-get update -qq 2>&1 > /dev/null!apt-get -y install -qq google-drive-ocamlfuse fusefrom google.colab import authauth.authenticate_user()from oauth2client.client import GoogleCredentialscreds = GoogleCredentials.get_application_default()import getpass!google-drive-ocamlfuse -headless -id={creds.client_id} -secret={creds.client_secret} < /dev/null 2>&1 | grep URLvcode = getpass.getpass()!echo {vcode} | google-drive-ocamlfuse -headless -id={creds.client_id} -secret={creds.client_secret} !mkdir -p drive!google-drive-ocamlfuse drive this would connect to your drive , and create a folder that your notebook can access your google drive from It would ask you for access to your drive , just click on the link , and copy the access token , it would ask this twice after writing this code , you run the code by clicking on the cell (shift enter) or by clicking the play button on the top of your code cell then you can simply access any file by its path in form of path = "drive/test.txt" 1-C Now Lets get the data that we would work on our data set that we would work on is in form of and their news headlines . The input would be and the output needed would be its summary or in this case would be the news content headline There are 2 popular dataset for this task Amazon Product Reviews CNN /Daily news dataset (which we would use in our case) you the data , you can just , it would just take some seconds not more. don’t have to download copy it to your google drive Here is the for the folder containing the data . Link Here we would use , which enables you to easily copy files between different google drives Copy, URL to Google Drive first you would paste the above Link paste your link , name it , then save to google drive then you simply click on Save,Copy to Google Drive (after autentication your google drive) after authenticating , you just click save to google drive Now after setup process , we can start our work , so lets Begin !! 2- Dependencies and paths 2-a First Lets install needed dependencies in google colab you are able to install using pip , by simply !pip, in every code section you simply click on and then just start writing your code !pip install gensim!pip install wget import nltknltk.download('punkt') 2-b then Lets set needed Dependencies from nltk.tokenize import word_tokenizeimport reimport collectionsimport pickleimport numpy as npfrom gensim.models.keyedvectors import KeyedVectorsfrom gensim.test.utils import get_tmpfilefrom gensim.scripts.glove2word2vec import glove2word2vec 2-c Then lets define where the data can be found default_path = "drive/Colab Notebooks/Model 2/" #default path for the folder inside google drive **#path for training text (article)**train_article_path = default_path + "sumdata/train/train.article.txt" **#path for training text output (headline)**train_title_path = default_path + "sumdata/train/train.title.txt" **#path for validation text (article)**valid_article_path = default_path + "sumdata/train/valid.article.filter.txt" **#path for validation text output(headline)**valid_title_path = default_path + "sumdata/train/valid.title.filter.txt" 3- Building A Dictionary for the text summarization to work , you must represent your words in a dictionary format assume we have an article like five-time world champion michelle kwan withdrew from the # us figure skating championships on wednesday , but will petition us skating officials for the chance to compete at the # turin olympics # each word would have a representation in a dict and we would also need the reverse operation also , like to apply this we would need some helper functions , like 3-A Simple cleaning data function the goal of this function would be a simple cleaning of data , just by replacing some unneeded characters with # def clean_str(sentence):sentence = re.sub("[#.]+", "#", sentence)return sentence this substitution of characters is rather simple , you can of course add multiple substitution steps 3-B Function that actually return text and apply the above cleaning function def get_text_list(data_path, toy):with open (data_path, "r", encoding="utf-8") as f:if not toy:return [clean_str(x.strip()) for x in f.readlines()][:200000]else:return [clean_str(x.strip()) for x in f.readlines()][:50] this function would be called for mltiple cases if you need to load training data or test data or if you just need a sample of any of the above by simply setting toy = True 3-C Now lets Build the function that would actually create the needed dictionary here you would see that we add 4 built-in words , these are essential for the seq2seq algorithim , they are <padding> this would be used to make the sequences of same length <unk> this would be used to identify that the word is not found inside the dict <s> this would be used to identify the beingin of a sentence </s> this would be used to identify the end of a sentence copy the code , as here the padding is incorrect due to the editor of medium from github def build_dict(step, toy=False):if step == "train": train_article_list = get_text_list(train_article_path, toy)train_title_list = get_text_list(train_title_path, toy) #First lets load the training data words = list()for sentence in train_article_list + train_title_list:for word in word_tokenize(sentence):words.append(word) #then lets collect all words from the training data#by simply tokenizing each text sample to its words#here we would use the built-in function imported from nltk toolkit#which simply return a list of words from a sentence word_counter = collections.Counter(words).most_common() word_dict = dict()word_dict["<padding>"] = 0word_dict["<unk>"] = 1word_dict["<s>"] = 2word_dict["</s>"] = 3 #we would only select the most common words #first lets set the 4 built-in words for word, _ in word_counter:word_dict[word] = len(word_dict) #then lets build our dict , by simply looping over word_co with open(default_path + "word_dict.pickle", "wb") as f:pickle.dump(word_dict, f) #then lets save this to a pickle #all of the above was for the training step#when you are in the validation you can simply load the pickles that#you have just saved elif step == "valid":with open(default_path + "word_dict.pickle", "rb") as f:word_dict = pickle.load(f) #for both of the 2 cases (training , or validation)#we would create a reversed dict reversed_dict = dict(zip(word_dict.values(), word_dict.keys())) #then we would simply for the 2 cases (training , or validation)#define a max len for article and for the summary article_max_len = 50summary_max_len = 15return word_dict, reversed_dict, article_max_len, summary_max_len 4- Now Lets Build Our Dataset After building the dict for our data , we would begin to build the actual dataset that would be used in our algorithm Using the above example of an article , five-time world champion michelle kwan withdrew from the # us figure skating championships on wednesday , but will petition us skating officials for the chance to compete at the # turin olympics # the algorthim would need this to be represented in which is simply getting the collection of word dict for the words in the given sentence same would occur on the test data def build_dataset(step, word_dict, article_max_len, summary_max_len, toy=False):**#---case of train#---we would load both (article , headline) for training** if step == "train":article_list = get_text_list(train_article_path, toy)title_list = get_text_list(train_title_path, toy)**#---case of valid#---we only load articles** elif step == "valid":article_list = get_text_list(valid_article_path, toy)**#---if step is neither (train nor valid) raise error** else:raise NotImplementedError**#---(for each aricle) get list of words#--- so now x (article) contains list of words** x = [word_tokenize(d) for d in article_list] x = [[word_dict.get(w, word_dict["<unk>"]) for w in d] for d in x] #---(for each aricle) get index of word from word_dict for each article#---if not found , use "<unk>" tokken#---so now we have our train dataset **#---(for each aricle) limit x to article_max_len** x = [d[:article_max_len] for d in x] **#---(for each aricle) if x was less than article_max_len#--- pad the x by using "<padding>" tokken** x = [d + (article_max_len - len(d)) * [word_dict["<padding>"]] for d in x] if step == "valid": return x else: **#-------if step = "train"#-------we must do the same steps on headline#-------but here we don't use the concept of padding** y = [word_tokenize(d) for d in title_list]y = [[word_dict.get(w, word_dict["<unk>"]) for w in d] for d in y]y = [d[:(summary_max_len - 1)] for d in y]return x, y so lets simply call both (build dict and build dataset) print("Building dictionary...")word_dict, reversed_dict, article_max_len, summary_max_len = build_dict("train", False) print("Loading training dataset...")train_x, train_y = build_dataset("train", word_dict, article_max_len, summary_max_len, False) 5- Word Embeddings But we can’t yet feed the our neural network with a list containing the indexes of words , as it would understand them . We need to represent the word itself in a format that our neural net would understand , and here comes the concept of word embeddings it is a simple concept , that replaces each word in your dict with a list of numbers , (in our case we would model each word with a 300 float number list) There are already trained models that have been trained over millions of text to correctly model the words , once you are able to correctly model the words , your neural net would be able to truly understand the text within the article . A very well known test to identify how well the algorithm understand text after using word embeddings , is applying word similarity on a given word as you can see , the output tells us that the model would now be capable of understanding the relations between words , which is an extremely important factor in the success of out neural net 5-A lets get the trained model for our work there is a very well known pretrained model called provided by stanford , you can download it from Glove pre-trained vectors https://nlp.stanford.edu/projects/glove/ or you can simply copy it from my google drive like i have explained before , here is the for the glove vectors in a pickle format link 5-B Build a function to get an array of word embeddings def get_init_embedding(reversed_dict, embedding_size): print("Loading Glove vectors...") **#---Load glove model which is in a pickle format** with open( default_path + "glove/model_glove_300.pkl", 'rb') as handle:word_vectors = pickle.load(handle) **#---Loop through all words within the reversed_dict** used_words = 0word_vec_list = list()for _, word in sorted(reversed_dict.items()):try: word_vec = word_vectors.word_vec(word)used_words += 1except KeyError: word_vec = np.zeros([embedding_size], dtype=np.float32) #to generate for <padding> and <unk> #-----------if the word i found in the dict ,#-----------save its value #-----------else , generate an array of zeros#-----------of length = embedding_size#-----------which in this case would be 300#-----------this is the case also for <padding> and <unk>#-----------where <s>, </s> token would be zeros#-----------like seen below word_vec_list.append(word_vec) #-------add it to the array#-------remember that we are looping in sorted reversed_dict#-------so the index of the element inside word_vec_list#-------would be the same as index of word#-------no need of a dict , an array is sufficient **#---just print out the percentage of knwon words** print("words found in glove percentage = " + str((used_words/len(word_vec_list))*100) ) **#----Assign random vector to <s>, </s> token** word_vec_list[2] = np.random.normal(0, 1, embedding_size)word_vec_list[3] = np.random.normal(0, 1, embedding_size) **#----then return the array**return np.array(word_vec_list) to call the function we simply call word_embedding = get_init_embedding(reversed_dict, 300) To sum it all UP so we can say that we have now correctly represented the text for our task of text summarization so to sum it all up , we have build the code to by simply calling word_dict, reversed_dict, article_max_len, summary_max_len = build_dict("train", False) train_x, train_y = build_dataset("train", word_dict, article_max_len, summary_max_len, False) word_embedding = get_init_embedding(reversed_dict, 300) the coming steps in the coming tutorial if GOD wills it , we would go through how to build the model itself , we would build a seq2seq encoder decoder model using LSTM , we would go through the very details of building such a model using tensorflow , this would be the corner stone for the next tutorials in the series , that would go through the latest approaches for this problem from using pointer generator model using reinforcement learning with deep learning don’t forget to clone the code for this tutorial from my repo and you can take a look on the talking about an overview on text summarization previous tutorial you can also check this talking about the eco system of a free deep learning platform blog I truly hope you have enjoyed this tutorial , i am waiting for your feedback , and i am waiting for you in the next tutorial if GOD wills it