This tutorial is the sixth one from a series of tutorials that would help you build an abstractive text summarizer using tensorflow , today we would build an abstractive text summarizer in tensorflow in an optimized way .
Today we would go through one of the most optimized models that has been built for this task , this model has been written by dongjun-Lee , this is the link to his model , I have used his model model on different datasets (in different languages) and it resulted in truly amazing results , so I would truly like to thank him for his effort
I have made multiple modifications to the model to enable it to enable it to run seamlessly on google colab (link to my model) , and i have hosted the data onto google drive (more on how to link google drive to google colab) , so no need to download neither the code , nor the data , you only need a google colab session to run the code , and copy the data from my google drive to yours (more on this) , and connect google drive to your notebook of google colab
This is a series of tutorials that would help you build an abstractive text summarizer using tensorflow using multiple approaches , we call it abstractive as we teach the neural network to generate words not to merely copy words .
We have covered so far (code for this series can be found here)
0. Overview on the free ecosystem for deep learning (how to use google colab with google drive)
The data that would be used would be news and their headers , it can be found on my google drive , so you just copy it to your google drive without the need to download it (more on this)
We would represent the data using word embeddings , which is simply converting each word to a specific vector , we would create a dictionary for our words (more on this) (prerequisites for this tutorial)
There are different approaches for this task , they are built over a corner stone concept , and they keep on developing and building up .
Today we would start building this corner stone implementation which is a type of network called RNN , which is arranged in an Encoder/Decoder architecture called seq2seq (more on this), then we would build the seq2seq in a mulitlayer bidirectional structure , where the rnn cell would be a LSTM cell (more on this) , then we would add an attention mechanism to better interface the encoder with the decoder (more on this) , then to generate better output we use the ingenious concept of beam search (more on this)
the code for all these different approaches can be found here
so lets get started !!
our model can be seen to structured into different blocks these blocks are
Here we would initialize the needed tensorflow placeholders & variables , and here would define our RNN cell that would be used throughout the model
Here we would define the embedding matrix used in both the encoder & the decoder
Here we would define the multilayer bidirectional RNN (more on this) that forms the encoder part of our model , and we output the encoder state as an input to the decoder part
Here the decoder is actually portioned into 2 distinct parts
This block would only be used in training phase , here we would apply clipping to our gradients , and we would actually run our optimizer (Adam Optimizer is used here) , and here is the place where we would apply our gradients to the optimizer.
First we would need to import the libs that we would use
import tensorflow as tf
from tensorflow.contrib import rnn #cell that we would use
Before Building our Model Class we need to get define some tensorflow concepts first
So we tend to define placeholders like this
X = tf.placeholder(tf.int32, [None, article_max_len])
# here we define the input x as int32 , with promise to provide its # data in runtime
#
# we also provide its shape , where None is used for a dimension of # any size
and for the variables we tend to define them as
global_step = tf.Variable(0, trainable=False)
# a variable must be intialized ,
# and we can set it to either be trainable or not
Then lets build our Model Class
class Model(object):
def __init__(self, reversed_dict, article_max_len, summary_max_len, args, forward_only=False):
self.vocabulary_size = len(reversed_dict)
self.embedding_size = args.embedding_size
self.num_hidden = args.num_hidden
self.num_layers = args.num_layers
self.learning_rate = args.learning_rate
self.beam_width = args.beam_width
we would pass an obj called args that would actually contain multiple parameters from
we would also need to initialize the model with other paaremetrs like
then to continue the initalization
if not forward_only: #training phase
#keep_prob as variable in training phase
self.keep_prob = args.keep_prob
else: #testing phase
#keep_prob constant in testing phase
self.keep_prob = 1.0
#here we would use LSTM as our cell
self.cell = tf.nn.rnn_cell.BasicLSTMCell
#projection layer that would be used in decoder in both
#training and testing phase
with tf.variable_scope("decoder/projection"):
self.projection_layer = tf.layers.Dense(self.vocabulary_size, use_bias=False)
#define batch size(our data would be provided in batches)
self.batch_size = tf.placeholder(tf.int32, (), name="batch_size")
#X as input , define as length of articles
self.X = tf.placeholder(tf.int32, [None, article_max_len])
self.X_len = tf.placeholder(tf.int32, [None])
#define decoder (input , target , length)
#using the summary length
self.decoder_input = tf.placeholder(tf.int32, [None, summary_max_len])
self.decoder_len = tf.placeholder(tf.int32, [None])
self.decoder_target = tf.placeholder(tf.int32, [None, summary_max_len])
#define global step beginning from zero
self.global_step = tf.Variable(0, trainable=False)
Here we would represent both our inputs articles that would be the embedded inputs and the decoder inputs using word2vector (more on this)
we would define our variables for embedding in a variable scope , we would name it embedding
with tf.name_scope("embedding"):
#if training ,
#and you enable args.glove variable to true
if not forward_only and args.glove:
#here we use tf.constant as we won't change it
#get_init_embedding is a function
#that returns the vector for each word in our dict
init_embeddings = tf.constant(get_init_embedding(reversed_dict, self.embedding_size), dtype=tf.float32)
else: #else random define the word2vector for testing
init_embeddings = tf.random_uniform([self.vocabulary_size, self.embedding_size], -1.0, 1.0)
self.embeddings = tf.get_variable("embeddings", initializer=init_embeddings)
#then define for both encoder input
self.encoder_emb_inp = tf.transpose(tf.nn.embedding_lookup(self.embeddings, self.X), perm=[1, 0, 2])
#and define for decoder input
self.decoder_emb_inp = tf.transpose(tf.nn.embedding_lookup(self.embeddings, self.decoder_input), perm=[1, 0, 2])
Here we would actually define the multilayer bidirectional lstm for the encoder part of our seq2seq (more on this) , we would define our variables here in a name scope that we would call “encoder”.
Here we would use the concept of Dropout , we would use it after each cell in our architecture , it is used to randomly activate a subset of our net, and is used during training for regularization.
with tf.name_scope("encoder"):
fw_cells = [self.cell(self.num_hidden) for _ in range(self.num_layers)]
bw_cells = [self.cell(self.num_hidden) for _ in range(self.num_layers)]
fw_cells = [rnn.DropoutWrapper(cell) for cell in fw_cells]
bw_cells = [rnn.DropoutWrapper(cell) for cell in bw_cells]
Now after defining the forward and backward cells , we would need to actually connect them together to form the bidirectional structure , so we would use stack_bidirectional_dynamic_rnn , which takes all of the following parameters as its inputs
forward cells
Backward Cells
Encoder emb input (input articles in word2vector format)
X_len (length of articles)
Using time_major = True is a bit more efficient because it avoids transposes at the beginning and end of the RNN calculation.
encoder_outputs, encoder_state_fw, encoder_state_bw = tf.contrib.rnn.stack_bidirectional_dynamic_rnn( fw_cells, bw_cells, self.encoder_emb_inp, sequence_length=self.X_len, time_major=True, dtype=tf.float32)
Now we would need to actually use the output from this stack_bidirectional_dynamic_rnn function , we mainly need 2 main outputs
so to get encoder_output we simply
self.encoder_output = tf.concat(encoder_outputs, 2)
then to get encoder_state , we would combine both (encoder_state_c) & (encoder_state_h) of both the forward & backward using LSTMStateTuple
encoder_state_c = tf.concat((encoder_state_fw[0].c, encoder_state_bw[0].c), 1)
encoder_state_h = tf.concat((encoder_state_fw[0].h, encoder_state_bw[0].h), 1)
self.encoder_state = rnn.LSTMStateTuple(c=encoder_state_c, h=encoder_state_h)
Here the decoder is divided into 2 parts
so lets first define out (name scope) & (variable scope) for both parts , we would also define a multilayer cell structure that would be also used for both parts
with tf.name_scope("decoder"), tf.variable_scope("decoder") as decoder_scope:
decoder_cell = self.cell(self.num_hidden * 2)
First we need to prepare our attention structure , here we would use BahdanauAttention
encoder_output would be used inside the attention calculation (more on attention model)
attention_states = tf.transpose(self.encoder_output, [1, 0, 2])
attention_mechanism = tf.contrib.seq2seq.BahdanauAttention(
self.num_hidden * 2, attention_states, memory_sequence_length=self.X_len, normalize=True)
then we would further define the decoder cell (as from the first step in decoder , we just defined the decoder cell as a simple multilayer lstm , now we would add attention) , to do this we would use AttentionWrapper , which combines attention_mechanism with decoder cell
decoder_cell = tf.contrib.seq2seq.AttentionWrapper(decoder_cell, attention_mechanism, attention_layer_size=self.num_hidden * 2)
Now we would need to define the inputs to the decoder cell , this input actually comes from 2 sources (more on seq2seq)
so lets first define the initial state that would come from the encoder
initial_state = decoder_cell.zero_state(dtype=tf.float32, batch_size=self.batch_size)
initial_state = initial_state.clone(cell_state=self.encoder_state)
now we would combine both the initial state with the decoder input (summary sentence) , here to use the BasicDecoder , we need to provide the decoder input through a helper , this helper would combine all of (decoder_emb_inp , decoder_len) together
helper = tf.contrib.seq2seq.TrainingHelper(self.decoder_emb_inp, self.decoder_len, time_major=True)
decoder = tf.contrib.seq2seq.BasicDecoder(decoder_cell, helper, initial_state)
outputs, _, _ = tf.contrib.seq2seq.dynamic_decode(decoder, output_time_major=True, scope=decoder_scope)
now for the last step of the training phase , we would need to define the outputs (logits)from the decoder , to be used within the loss block for training
#just use the rnn_outputs from aall the outputs
self.decoder_output = outputs.rnn_output
#then get logits , by performing a transpose on decoder output
self.logits = tf.transpose(self.projection_layer(self.decoder_output), perm=[1, 0, 2])
#then reshape the logits
self.logits_reshape = tf.concat(
[self.logits, tf.zeros([self.batch_size, summary_max_len - tf.shape(self.logits)[1], self.vocabulary_size])], axis=1)
Here in this phase , there are 2 main goals
first lets divide encoder output & encoder states & x_len (article length) to parts to actually perform the beam search methodology , here we would use beam_width variable that was already defined above
tiled_encoder_output = tf.contrib.seq2seq.tile_batch(
tf.transpose(self.encoder_output, perm=[1, 0, 2]), multiplier=self.beam_width)
tiled_encoder_final_state = tf.contrib.seq2seq.tile_batch(self.encoder_state, multiplier=self.beam_width)
tiled_seq_len = tf.contrib.seq2seq.tile_batch(self.X_len, multiplier=self.beam_width)
then lets define the attention mechanism (just like before , but taking the tiled variables into consideration)
attention_mechanism = tf.contrib.seq2seq.BahdanauAttention(
self.num_hidden * 2, tiled_encoder_output, memory_sequence_length=tiled_seq_len, normalize=True)
decoder_cell = tf.contrib.seq2seq.AttentionWrapper(decoder_cell, attention_mechanism, attention_layer_size=self.num_hidden * 2)
initial_state = decoder_cell.zero_state(dtype=tf.float32, batch_size=self.batch_size * self.beam_width)
initial_state = initial_state.clone(cell_state=tiled_encoder_final_state)
then lets define our decoder , but here we would use the BeamSearchDecoder , this takes into consideration all of
Decoder cell (previously defined)
Embedding word2vector (defined in embedding part)
projection layer (defined in the beginning of class)
decoder initial state (previously defined)
beam_width (user defined)
start token & end token
decoder = tf.contrib.seq2seq.BeamSearchDecoder( cell=decoder_cell, embedding=self.embeddings, start_tokens=tf.fill([self.batch_size], tf.constant(2)), end_token=tf.constant(3), initial_state=initial_state, beam_width=self.beam_width, output_layer=self.projection_layer )
then all what is left to do , is to define the outputs , that would actually directly reflect to the real output from the whole seq2seq architecture , as this phase is where prediction is actually computed
outputs, _, _ = tf.contrib.seq2seq.dynamic_decode(
decoder, output_time_major=True, maximum_iterations=summary_max_len, scope=decoder_scope)
self.prediction = tf.transpose(outputs.predicted_ids, perm=[1, 2, 0])
This block is where training actually occurs , here training actually occurs through multiple steps
First we define our name scope , and we would specify that this block would only work through the training phase
with tf.name_scope("loss"):
if not forward_only:
Second we would calculate the loss (more on loss calculation)
crossent = tf.nn.sparse_softmax_cross_entropy_with_logits(
logits=self.logits_reshape, labels=self.decoder_target)
weights = tf.sequence_mask(self.decoder_len, summary_max_len, dtype=tf.float32)
self.loss = tf.reduce_sum(crossent * weights / tf.to_float(self.batch_size))
Third we would calculate our gradients , and apply clipping on gradients to solve the problem of exploding gradients (more on exploding gradients)
Exploding Gradients : Occurs with deep networks (i.e: networks with many layers like in our case) , when we apply back propagation, the gradients would get too large . Actually this error can be solved rather easy , using the concept of gradient clipping , which is simply setting a specific threshold , that when the gradients exceed it , we would clip it to a certain value .
params = tf.trainable_variables()
gradients = tf.gradients(self.loss, params)
clipped_gradients, _ = tf.clip_by_global_norm(gradients, 5.0)
Forth we would apply our optimizer , here we would use Adam optimizer , here we would use the previously defined learning_rate
optimizer = tf.train.AdamOptimizer(self.learning_rate)
self.update = optimizer.apply_gradients(zip(clipped_gradients, params), global_step=self.global_step)
Next Time if GOD wills it , we would go through
Then after we are done with this core model implementation , if GOD wills it , we would go other modern implementations for text summarization like
(more on different implementations for seq2seq for text summarization)
All the code for this tutorial is found as open source here .
I truly hope you have enjoyed reading this tutorial , and i hope i have made these concepts clear , all the code for this series of tutorials are found here , you can simply use google colab to run it , please review the tutorial and tell me what do you think about it , hope to see you again
<a href="https://medium.com/media/3c851dac986ab6dbb2d1aaa91205a8eb/href">https://medium.com/media/3c851dac986ab6dbb2d1aaa91205a8eb/href</a>