This tutorial is the eighth one from a series of tutorials that would help you build an abstractive text summarizer using tensorflow. Today we would use to solve a major problem that the models suffer from. curriculum learning seq2seq seq2seq models are by maximizing the likelihood of next token given BOTH trained previous token (from previous ) LSTM ground truth summary while in it can only depend on inference (testing), previous token no ground truth summary can be provided in testing, seq2seq model has been trained to depend on the outside . while testing , it is forced to , which is something it hasn’t been raised to do! only depend on itself This actually causes a major problem, which is the discrepancy between training and inference (testing), this is called ( ) Exposure Problem There have been multiple approaches to solve this problem. One of them is, while in training, make the model (i.e: learn from its mistakes while in training phase). This is what is called ‘ which is a form of curriculum learning that we would use to help our models. begin learning to depend on itself by exposing the model to its own mistakes so that it tries to optimize them Scheduled Sampling’ seq2seq This model has been implemented using tensorflow ( in a jupyter notebook to run on google colab and connect seamlessly with google drive, so there is no need to either run code on your machine or download data as all can be done on google colab for free ( ). code can be found here ) more on this This tutorial is built over the concepts addressed by bengio,vinyals,ndjaitly,noamg from google in their ( ) paper Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks code from yasterk , I have modified it to run on google colab ( my code ) 0. About Series This is a series of tutorials that would help you build an abstractive text summarizer using tensorflow in multiple approaches , we call it abstractive as we teach the neural network to generate words not to merely copy words We have covered so far (code for this series can be found ) here 0. (how to use google colab with google drive) Overview on the free ecosystem for deep learning Overview of the text summarization task and the different techniques for the task (prerequisites for this tutorial) Data used and how it could be represented for our task What is seq2seq for text summarization and why Multilayer Bidirectional LSTM/GRU Beam Search & Attention for text summarization Building a seq2seq model with attention & beam search Combination of Abstractive & Extractive methods for Text Summarization EazyMind free Ai-As-a-service for text summarization You can actually try generating your own summaries using the output of these series, through and see what you would eventually be able to build yourself. You can also call it through simple API calls, and through a , so that text summarization can be easily integrated into your application without the hassle of setting up the tensorflow environment. You can for free, and enjoy using this API for free. eazymind python package register Let's begin! 1. Exposure bias problem The model has never been raised to depend on itself. seq2seq models are trained to depend on: the output from the previous node of the decoder , thus depending on output of the previous state and the input summary The problem arises in the inference (testing) step where the model is not provided the input summary. It only depends on: the output from the previous node ( previous lstm decoder step ) This causes a discrepancy between how the model is trained and how it runs in inference (testing). This problem is called Exposure bias. 2. How would the Exposure bias problem affect our model? In the inference (testing) phase, as we have just said, the model only depends on the previous step, which means that it totally depends on itself. The problem actually arises when the model results in a bad output in (i.e. the previous time step results in a bad output). This would actually affect all the coming sequences. It would lead the model to an entirely different state space from where it has seen and trained on in the training phase, so it simply won’t be able to know what to do. This would simply result in cumulative bad output decisions. (t-1) 3. Let's solve it by curriculum learning A solution to this problem that has been suggested by from google research, was to gradually change the reliance of the model from being totally dependent on the ground truth being supplied to it to depending on itself (i.e. depend on only its previous tokens generated from previous time steps in the decoder). bengio et ai The concept of making the learning path difficult through time (i.e. making the model depends on only itself) is called curriculum learning. Their technique to implement this was truly genius. They call it ‘ scheduled sampling’. They build a simple sampling mechanism which would randomly choose (during training) where to sample from. Either: ground truth ( ) (i stands for number of batch) with probability ei model itself ( ) with probability (1-ei) So let’s flip a coin. If it’s heads ( )→ then we use the ground truth summary. with probability ei If it’s tails ( )→ we use the output from the previous time step. with probability (1-ei) coin animation borrowed from google search results Intuitively we can have an even better approach. Not just having a constant , but it can be variable, as at the beginning of the training we While at the end of the training , as the model would have learnt even more. So let’s (probability). e can favor using the ground truth summaries. we can favor using the output from the model itself schedule the decay of e Borrowed from from google research bengio et ai The decay of e itself can be a function of the number of iterations. From here comes the word scheduled sampling. 4. Implement scheduled sampling in Tensorflow built a great library in tensorflow that enables you to implement multiple papers concerning text summarization, one of them was ( ), I have modified it to run on google colab ( ). Yasterk Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks my code The library can be adjusted to implement multiple papers by just modifying the flags, here (in jupyter notebook) I have modified the required flags, and also enabled a version of the decoder called intradecoder (to limit word repetition), so you would just run the example (with the set flags). my code We work on the news data of . It is a widely used dataset for this task, or you can copy the dataset directly from , to your own google drive (without the need to download and then upload), and to seamlessly connect to your google colab ( ). CNN / Daily News my google drive more about this Next time, we will go through the combination of reinforcement learning with deep learning to solve the exposure problem to solve other problem that seq2seq suffers from. I truly hope you have enjoyed reading this tutorial , and I hope I have made these concepts clear. All the code for this series of tutorials are found . You can simply use google colab to run it, please review the tutorial and the code and tell me what do you think about it, don’t forget to try out for free text summarization generation, hope to see you again. here eazymind