Nikhil B

@nik_b

Text Generation for Char LSTM models

November 6th 2018

Train a character-level language model on a corpus of jokes.

I decided to experiment with approaches to this problem, which I found on OpenAI’s Request for Research blog. You can have a look at the code here. This is written in Pytorch, and is heavily inspired by Fast.ai’s fantastic lesson on implementing RNN’s from scratch.

Data preparation I started off using the dataset provided by OpenAI. The data was converted to lowercase and for an initial run, I selected the top rated jokes, with a word length of less than 200. Here’s an example of all the tokens encountered:

*Explicit words ahead!* This particular dataset has explicit words/content, so those come up in the output predictions of the model. Another interesting problem to work on would be to filter out inappropriate words from the output of the model — either by checking the generated output for appearance of a known set of explicit words, or better still — by building a clean/not clean sentiment classifier! I also ran the results using a much cleaner dataset I found on github.

I tried several models for LSTM’s primarily varying the batch size, hidden layers. Training the char level model got progressively more difficult as the number of back-propagation char tokens (bptt) increased. Lets focus mainly on the Text Generation part for now. The model outputs one character at a time and our goal is to generate a bunch of (seemingly intelligible and possibly funny) text. There are many approaches that could be taken:

Greedy approach:

This is the simplest method in which we recursively select the most probable next sequence character. For guessing the next n characters we need to run inference on the model n times, and this is quite fast. The output in this case has less diversity however, and it is prone to being stuck in a loop. There seem to be a set of char sequences which are highly probable and the prediction often converges to one of these sequences.

Top K or Multinomial approach:

This approach models the occurrence of an outcome out of n categories each with defined probability of occurrences. In order to inject more diversity into the outputs, instead of selecting the most probable character we model the Neural network as a probability event and observe which token is output in a single play. Observing the results however, this method isnt great at producing complete words. The structure inherent within words is broken more easily. One alternative would be to use the multinational distribution sparingly, in combination with the greedy best fit approach

Combination

We saw that using a greedy approach always can lead to a looping or repeated pattern, while the multinomial of top-k approach produces more unintelligible words at the char level. I tried to get a balance of both worlds, by using a greedy approach to generate characters, until a space (gap in words) is encountered. At this point I use a multinomial generator, so that there is more diversity in word choice. This results in text that is more diverse, but with more intelligible words as a whole.

Beam search

Beam search is a method of trying to get more ‘optimal results’ and we look at predicting sequences longer than one character in an iteration. Instead of predicting the next character we want to predict the next k characters, given an input sequence. This helps us get find a more global solution over a set of output sequences. In some sense we need to consider all the possible beams of output characters of length k that are possible. and need to have some metric of comparison.

A couple of things help us out while calculating probabilities: First, using Bayes rule, we can model the probability of a getting a particular output sequence as a product of individual conditional probability sequences. For eg: Score of predicting ‘and’ given input sequence ‘The sun ‘ will be Score = P(‘a’| input = ‘The sun ‘) *P(‘n’| input = ‘he sun a’)* P(‘d’ | input = ‘e sun an’). Second, the output of the model soft-max is often in log format, and this makes out implementations easier, we can add the log values instead of multiplying them.

Shown below is most of the code needed to implement a beam search filter. We define a data-structure letters to store all possible sequences of a give length (say 3 ) and a placeholder for its score. The charscore function calculates the probability of predicting a given output for a given input sequence. The beam_search function iterates through the possible sequences and calculates the score for each sequence. beam_text is the function which iteratively applies the beam_search function to generate a sequence of a give length.

This naive implementation of beam search takes quite long (for e.g. 8 min to produce 15 samples). Below is another example, we are generating 15 beams, or 45 char tokens.

I believe this speed can be reduced a great deal by running inference fully on CPU, and by optimizing beam_search better.

Future experiments

Here are a few more variants that I’ve been trying, with no great success(yet). These are mainly driven by the slow running time of beam search.

Greedy beam search: This is similar to beam search, but instead of searching over all sequences, we search through the ‘most probable’ sequences. For e.g. for a beam width of 3, we might look at the top-10 most probable choices in each of the 3 stages. So we would search through 10*10*10 beams every iteration. This certainly cuts down the delay problem by a great deal, but the result quickly falls into a cycle, with repeating patters. For e.g, two beam patterns ‘.. ‘ and ‘…’ repeating endlessly.

In order to add more diversity, I multiplied individual probabilities instead of adding them, while calculating the beam score. The results were more diverse, but not necessarily better.

[i, j, k], i_score * j_score * k_score]

References:

https://github.com/coderbee/jokes/

Short Jokes dataset

More Related Stories