this story is a continuation to the series on how to easily build an abstractive text summarizer , (check out github repo for this series) , today we would go through how you would be able to build a summarizer able to understand words , so we would through representing words to our summarizer
my goal in this series to present the latest novel ways of abstractive text summarization in a simple way , (you can check my overview blog) from
we would use google colab , so you won’t have to use a powerful computer , nor would you have to download data to your device , as we would connect google drive to google colab to have a fully integrated deep learning experience (you can check my overview on working on free deep learning ecosystem platforms)
All code can be found online through my github repo
This tutorial has been based over the work of https://github.com/dongjun-Lee/text-summarization-tensorflow , they have truly made great work on simplifying the needed work to apply summarization using tensorflow, I have built over their code , to convert it to a python notebook to work on google colab , I truly admire their work
so lets begin !!
1- go to https://colab.research.google.com
2- select Google Drive Tab (to save your new google colab to google drive)
3- select New Python 3 Notebook (you can also select python 2 notebook)
a blank notebook would be created to your google drive , it would look like this
You can change the runtime of your notebook from selecting the runtime button in the top menu , to
in the newly created notebook , add a new code cell
then paste this code in it
!apt-get install -y -qq software-properties-common python-software-properties module-init-tools!add-apt-repository -y ppa:alessandro-strada/ppa 2>&1 > /dev/null!apt-get update -qq 2>&1 > /dev/null!apt-get -y install -qq google-drive-ocamlfuse fusefrom google.colab import authauth.authenticate_user()from oauth2client.client import GoogleCredentialscreds = GoogleCredentials.get_application_default()import getpass!google-drive-ocamlfuse -headless -id={creds.client_id} -secret={creds.client_secret} < /dev/null 2>&1 | grep URLvcode = getpass.getpass()!echo {vcode} | google-drive-ocamlfuse -headless -id={creds.client_id} -secret={creds.client_secret}
!mkdir -p drive!google-drive-ocamlfuse drive
this would connect to your drive , and create a folder that your notebook can access your google drive from
It would ask you for access to your drive , just click on the link , and copy the access token , it would ask this twice
after writing this code , you run the code by clicking on the cell (shift enter) or by clicking the play button on the top of your code cell
then you can simply access any file by its path in form of
path = "drive/test.txt"
our data set that we would work on is in form of news and their headlines .
The input would be news content and the output needed would be its summary or in this case would be the headline
There are 2 popular dataset for this task
you don’t have to download the data , you can just copy it to your google drive , it would just take some seconds not more.
Here is the Link for the folder containing the data .
Here we would use Copy, URL to Google Drive , which enables you to easily copy files between different google drives
first you would paste the above Link
paste your link , name it , then save to google drive
then you simply click on Save,Copy to Google Drive (after autentication your google drive)
after authenticating , you just click save to google drive
Now after setup process , we can start our work , so lets Begin !!
in google colab you are able to install using pip , by simply !pip,
in every code section you simply click on
and then just start writing your code
!pip install gensim!pip install wget
import nltknltk.download('punkt')
from nltk.tokenize import word_tokenizeimport reimport collectionsimport pickleimport numpy as npfrom gensim.models.keyedvectors import KeyedVectorsfrom gensim.test.utils import get_tmpfilefrom gensim.scripts.glove2word2vec import glove2word2vec
#default path for the folder inside google drivedefault_path = "drive/Colab Notebooks/Model 2/"
**#path for training text (article)**train_article_path = default_path + "sumdata/train/train.article.txt"
**#path for training text output (headline)**train_title_path = default_path + "sumdata/train/train.title.txt"
**#path for validation text (article)**valid_article_path = default_path + "sumdata/train/valid.article.filter.txt"
**#path for validation text output(headline)**valid_title_path = default_path + "sumdata/train/valid.title.filter.txt"
for the text summarization to work , you must represent your words in a dictionary format
assume we have an article like
five-time world champion michelle kwan withdrew from the # us figure skating championships on wednesday , but will petition us skating officials for the chance to compete at the # turin olympics #
each word would have a representation in a dict
and we would also need the reverse operation also , like
to apply this we would need some helper functions , like
the goal of this function would be a simple cleaning of data , just by replacing some unneeded characters with #
def clean_str(sentence):sentence = re.sub("[#.]+", "#", sentence)return sentence
this substitution of characters is rather simple , you can of course add multiple substitution steps
and apply the above cleaning function
def get_text_list(data_path, toy):with open (data_path, "r", encoding="utf-8") as f:if not toy:return [clean_str(x.strip()) for x in f.readlines()][:200000]else:return [clean_str(x.strip()) for x in f.readlines()][:50]
this function would be called for mltiple cases
here you would see that we add 4 built-in words , these are essential for the seq2seq algorithim , they are
copy the code from github , as here the padding is incorrect due to the editor of medium
def build_dict(step, toy=False):if step == "train":#First lets load the training datatrain_article_list = get_text_list(train_article_path, toy)train_title_list = get_text_list(train_title_path, toy)
#then lets collect all words from the training data#by simply tokenizing each text sample to its words#here we would use the built-in function imported from nltk toolkit#which simply return a list of words from a sentencewords = list()for sentence in train_article_list + train_title_list:for word in word_tokenize(sentence):words.append(word)
#we would only select the most common wordsword_counter = collections.Counter(words).most_common()#first lets set the 4 built-in wordsword_dict = dict()word_dict["<padding>"] = 0word_dict["<unk>"] = 1word_dict["<s>"] = 2word_dict["</s>"] = 3
#then lets build our dict , by simply looping over word_cofor word, _ in word_counter:word_dict[word] = len(word_dict)
#then lets save this to a picklewith open(default_path + "word_dict.pickle", "wb") as f:pickle.dump(word_dict, f)
#all of the above was for the training step#when you are in the validation you can simply load the pickles that#you have just saved
elif step == "valid":with open(default_path + "word_dict.pickle", "rb") as f:word_dict = pickle.load(f)
#for both of the 2 cases (training , or validation)#we would create a reversed dict
reversed_dict = dict(zip(word_dict.values(), word_dict.keys()))
#then we would simply for the 2 cases (training , or validation)#define a max len for article and for the summary
article_max_len = 50summary_max_len = 15return word_dict, reversed_dict, article_max_len, summary_max_len
After building the dict for our data , we would begin to build the actual dataset that would be used in our algorithm
Using the above example of an article ,
five-time world champion michelle kwan withdrew from the # us figure skating championships on wednesday , but will petition us skating officials for the chance to compete at the # turin olympics #
the algorthim would need this to be represented in
which is simply getting the collection of word dict for the words in the given sentence
same would occur on the test data
def build_dataset(step, word_dict, article_max_len, summary_max_len, toy=False):**#---case of train#---we would load both (article , headline) for training** if step == "train":article_list = get_text_list(train_article_path, toy)title_list = get_text_list(train_title_path, toy)**#---case of valid#---we only load articles** elif step == "valid":article_list = get_text_list(valid_article_path, toy)**#---if step is neither (train nor valid) raise error** else:raise NotImplementedError**#---(for each aricle) get list of words#--- so now x (article) contains list of words** x = [word_tokenize(d) for d in article_list]
#---(for each aricle) get index of word from word_dict for each article#---if not found , use "<unk>" tokken#---so now we have our train datasetx = [[word_dict.get(w, word_dict["<unk>"]) for w in d] for d in x]
**#---(for each aricle) limit x to article_max_len** x = [d[:article_max_len] for d in x]
**#---(for each aricle) if x was less than article_max_len#--- pad the x by using "<padding>" tokken** x = [d + (article_max_len - len(d)) * [word_dict["<padding>"]] for d in x]
if step == "valid":
return x
else:
**#-------if step = "train"#-------we must do the same steps on headline#-------but here we don't use the concept of padding** y = [word_tokenize(d) for d in title_list]y = [[word_dict.get(w, word_dict["<unk>"]) for w in d] for d in y]y = [d[:(summary_max_len - 1)] for d in y]return x, y
so lets simply call both (build dict and build dataset)
print("Building dictionary...")word_dict, reversed_dict, article_max_len, summary_max_len = build_dict("train", False)
print("Loading training dataset...")train_x, train_y = build_dataset("train", word_dict, article_max_len, summary_max_len, False)
But we can’t yet feed the our neural network with a list containing the indexes of words , as it would understand them .
We need to represent the word itself in a format that our neural net would understand , and here comes the concept of word embeddings
it is a simple concept , that replaces each word in your dict with a list of numbers , (in our case we would model each word with a 300 float number list)
There are already trained models that have been trained over millions of text to correctly model the words , once you are able to correctly model the words , your neural net would be able to truly understand the text within the article .
A very well known test to identify how well the algorithm understand text after using word embeddings , is applying word similarity on a given word
as you can see , the output tells us that the model would now be capable of understanding the relations between words , which is an extremely important factor in the success of out neural net
there is a very well known pretrained model called Glove pre-trained vectors provided by stanford , you can download it from https://nlp.stanford.edu/projects/glove/
or you can simply copy it from my google drive like i have explained before , here is the link for the glove vectors in a pickle format
def get_init_embedding(reversed_dict, embedding_size):
print("Loading Glove vectors...")
**#---Load glove model which is in a pickle format** with open( default_path + "glove/model_glove_300.pkl", 'rb') as handle:word_vectors = pickle.load(handle)
**#---Loop through all words within the reversed_dict** used_words = 0word_vec_list = list()for _, word in sorted(reversed_dict.items()):try:#-----------if the word i found in the dict ,#-----------save its valueword_vec = word_vectors.word_vec(word)used_words += 1except KeyError:#-----------else , generate an array of zeros#-----------of length = embedding_size#-----------which in this case would be 300#-----------this is the case also for <padding> and <unk>#-----------where <s>, </s> token would be zeros#-----------like seen belowword_vec = np.zeros([embedding_size], dtype=np.float32) #to generate for <padding> and <unk>
#-------add it to the array#-------remember that we are looping in sorted reversed_dict#-------so the index of the element inside word_vec_list#-------would be the same as index of word#-------no need of a dict , an array is sufficientword_vec_list.append(word_vec)
**#---just print out the percentage of knwon words** print("words found in glove percentage = " + str((used_words/len(word_vec_list))*100) )
**#----Assign random vector to <s>, </s> token** word_vec_list[2] = np.random.normal(0, 1, embedding_size)word_vec_list[3] = np.random.normal(0, 1, embedding_size)
**#----then return the array**return np.array(word_vec_list)
to call the function we simply call
word_embedding = get_init_embedding(reversed_dict, 300)
so we can say that we have now correctly represented the text for our task of text summarization
so to sum it all up , we have build the code to
by simply calling
word_dict, reversed_dict, article_max_len, summary_max_len = build_dict("train", False)
train_x, train_y = build_dataset("train", word_dict, article_max_len, summary_max_len, False)
word_embedding = get_init_embedding(reversed_dict, 300)
the coming steps in the coming tutorial if GOD wills it , we would go through how to build the model itself , we would build a seq2seq encoder decoder model using LSTM , we would go through the very details of building such a model using tensorflow , this would be the corner stone for the next tutorials in the series , that would go through the latest approaches for this problem from
don’t forget to clone the code for this tutorial from my repo
and you can take a look on the previous tutorial talking about an overview on text summarization
you can also check this blog talking about the eco system of a free deep learning platform
I truly hope you have enjoyed this tutorial , i am waiting for your feedback , and i am waiting for you in the next tutorial if GOD wills it