amr zaki


Teach seq2seq models to learn from their mistakes using deep curriculum learning (Tutorial 8)

scheduled sampling to help seq2seq model learn from its mistakes

This tutorial is the eighth one from a series of tutorials that would help you build an abstractive text summarizer using tensorflow.

Today we would use curriculum learning to solve a major problem that the seq2seq models suffer from.

seq2seq models are trained by maximizing the likelihood of next token given BOTH

  1. previous token (from previous LSTM)
  2. ground truth summary

while in inference (testing), it can only depend on

  1. previous token

no ground truth summary can be provided in testing,

seq2seq model has been trained to depend on the outside .
while testing , it is forced to only depend on itself, which is something it hasn’t been raised to do!

This actually causes a major problem, which is the discrepancy between training and inference (testing), this is called (Exposure Problem)

There have been multiple approaches to solve this problem. One of them is, while in training, make the model begin learning to depend on itself by exposing the model to its own mistakes so that it tries to optimize them (i.e: learn from its mistakes while in training phase). This is what is called ‘Scheduled Sampling’ which is a form of curriculum learning that we would use to help our seq2seq models.

This model has been implemented using tensorflow (code can be found here) in a jupyter notebook to run on google colab and connect seamlessly with google drive, so there is no need to either run code on your machine or download data as all can be done on google colab for free (more on this).

This tutorial is built over the concepts addressed by bengio,vinyals,ndjaitly,noamg from google in their paper (Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks)
code from yasterk , I have modified it to run on google colab (my code)

0. About Series

This is a series of tutorials that would help you build an abstractive text summarizer using tensorflow in multiple approaches , we call it abstractive as we teach the neural network to generate words not to merely copy words

We have covered so far (code for this series can be found here)

0. Overview on the free ecosystem for deep learning (how to use google colab with google drive)

  1. Overview of the text summarization task and the different techniques for the task
  2. Data used and how it could be represented for our task (prerequisites for this tutorial)
  3. What is seq2seq for text summarization and why
  4. Multilayer Bidirectional LSTM/GRU
  5. Beam Search & Attention for text summarization
  6. Building a seq2seq model with attention & beam search
  7. Combination of Abstractive & Extractive methods for Text Summarization
EazyMind free Ai-As-a-service for text summarization

You can actually try generating your own summaries using the output of these series, through eazymind and see what you would eventually be able to build yourself. You can also call it through simple API calls, and through a python package, so that text summarization can be easily integrated into your application without the hassle of setting up the tensorflow environment. You can register for free, and enjoy using this API for free.

Let's begin!

1. Exposure bias problem

The model has never been raised to depend on itself.

seq2seq models are trained to depend on:

  1. the output from the previous node of the decoder , thus depending on output of the previous state
  2. and the input summary

The problem arises in the inference (testing) step where the model is not provided the input summary. It only depends on:

  1. the output from the previous node ( previous lstm decoder step )

This causes a discrepancy between how the model is trained and how it runs in inference (testing). This problem is called Exposure bias.

2. How would the Exposure bias problem affect our model?

In the inference (testing) phase, as we have just said, the model only depends on the previous step, which means that it totally depends on itself.

The problem actually arises when the model results in a bad output in (t-1) (i.e. the previous time step results in a bad output). This would actually affect all the coming sequences. It would lead the model to an entirely different state space from where it has seen and trained on in the training phase, so it simply won’t be able to know what to do. This would simply result in cumulative bad output decisions.

3. Let's solve it by curriculum learning

A solution to this problem that has been suggested by bengio et ai from google research, was to gradually change the reliance of the model from being totally dependent on the ground truth being supplied to it to depending on itself (i.e. depend on only its previous tokens generated from previous time steps in the decoder).

The concept of making the learning path difficult through time (i.e. making the model depends on only itself) is called curriculum learning.

Their technique to implement this was truly genius. They call it ‘scheduled sampling’.

They build a simple sampling mechanism which would randomly choose (during training) where to sample from. Either:

  1. ground truth (with probability ei ) (i stands for number of batch)
  2. model itself (with probability (1-ei) )

So let’s flip a coin.

If it’s heads (with probability ei)→ then we use the ground truth summary.

If it’s tails (with probability (1-ei) )→ we use the output from the previous time step.

coin animation borrowed from google search results

Intuitively we can have an even better approach. Not just having a constant e, but it can be variable, as at the beginning of the training we can favor using the ground truth summaries. While at the end of the training we can favor using the output from the model itself, as the model would have learnt even more. So let’s schedule the decay of e (probability).

Borrowed from bengio et ai from google research

The decay of e itself can be a function of the number of iterations.

From here comes the word scheduled sampling.

4. Implement scheduled sampling in Tensorflow

Yasterk built a great library in tensorflow that enables you to implement multiple papers concerning text summarization, one of them was (Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks), I have modified it to run on google colab (my code).

The library can be adjusted to implement multiple papers by just modifying the flags, here (in my code jupyter notebook) I have modified the required flags, and also enabled a version of the decoder called intradecoder (to limit word repetition), so you would just run the example (with the set flags).

We work on the news data of CNN / Daily News. It is a widely used dataset for this task, or you can copy the dataset directly from my google drive, to your own google drive (without the need to download and then upload), and to seamlessly connect to your google colab (more about this).

Next time, we will go through the combination of reinforcement learning with deep learning to solve the exposure problem to solve other problem that seq2seq suffers from.

I truly hope you have enjoyed reading this tutorial , and I hope I have made these concepts clear. All the code for this series of tutorials are found here. You can simply use google colab to run it, please review the tutorial and the code and tell me what do you think about it, don’t forget to try out eazymind for free text summarization generation, hope to see you again.

More by amr zaki

Topics of interest

More Related Stories