Train a character-level language model on a corpus of jokes. I decided to experiment with approaches to this problem, which I found on OpenAI’s blog. You can have a look at the code . This is written in , and is heavily inspired by ’s fantastic lesson on implementing RNN’s from scratch. Request for Research here Pytorch Fast.ai I started off using the provided by . The data was converted to lowercase and for an initial run, I selected the top rated jokes, with a word length of less than 200. Here’s an example of all the tokens encountered: Data preparation dataset OpenAI This particular dataset has explicit words/content, so those come up in the output predictions of the model. Another interesting problem to work on would be to filter out inappropriate words from the output of the model — either by checking the generated output for appearance of a known set of explicit words, or better still — by building a clean/not clean sentiment classifier! I also ran the results using a much cleaner I found on github. *Explicit words ahead!* dataset I tried several models for LSTM’s primarily varying the batch size, hidden layers. Training the char level model got progressively more difficult as the number of back-propagation char tokens ( ) increased. Lets focus mainly on the Text Generation part for now. The model outputs one character at a time and our goal is to generate a bunch of (seemingly intelligible and possibly funny) text. There are many approaches that could be taken: bptt Greedy approach: This is the simplest method in which we recursively select the most probable next sequence character. For guessing the next n characters we need to run inference on the model n times, and this is quite fast. The output in this case has less diversity however, and it is prone to being stuck in a loop. There seem to be a set of char sequences which are highly probable and the prediction often converges to one of these sequences. Top K or Multinomial approach: This approach models the occurrence of an outcome out of n categories each with defined probability of occurrences. In order to inject more diversity into the outputs, instead of selecting the most probable character we model the Neural network as a probability event and observe which token is output in a single play. Observing the results however, this method isnt great at producing complete words. The structure inherent within words is broken more easily. One alternative would be to use the multinational distribution sparingly, in combination with the greedy best fit approach Combination We saw that using a greedy approach always can lead to a looping or repeated pattern, while the multinomial of top-k approach produces more unintelligible words at the char level. I tried to get a balance of both worlds, by using a greedy approach to generate characters, until a space (gap in words) is encountered. At this point I use a multinomial generator, so that there is more diversity in word choice. This results in text that is more diverse, but with more intelligible words as a whole. Beam search Beam search is a method of trying to get more ‘optimal results’ and we look at predicting sequences longer than one character in an iteration. Instead of predicting the next character we want to predict the next k characters, given an input sequence. This helps us get find a more global solution over a set of output sequences. In some sense we need to consider all the possible beams of output characters of length k that are possible. and need to have some metric of comparison. A couple of things help us out while calculating probabilities: First, using Bayes rule, we can model the probability of a getting a particular output sequence as a product of individual conditional probability sequences. For eg: Score of predicting ‘and’ given input sequence ‘The sun ‘ will be Second, the output of the model soft-max is often in log format, and this makes out implementations easier, we can add the log values instead of multiplying them. Score = P(‘a’| input = ‘The sun ‘) *P(‘n’| input = ‘he sun a’)* P(‘d’ | input = ‘e sun an’). Shown below is most of the code needed to implement a beam search filter. We define a data-structure to store all possible sequences of a give length (say 3 ) and a placeholder for its score. The function calculates the probability of predicting a given output for a given input sequence. The function iterates through the possible sequences and calculates the score for each sequence. is the function which iteratively applies the beam_search function to generate a sequence of a give length. letters charscore beam_search beam_text This naive implementation of beam search takes quite long (for e.g. 8 min to produce 15 samples). Below is another example, we are generating 15 beams, or 45 char tokens. I believe this speed can be reduced a great deal by running inference fully on CPU, and by optimizing better. beam_search Future experiments Here are a few more variants that I’ve been trying, with no great success(yet). These are mainly driven by the slow running time of beam search. This is similar to beam search, but instead of searching over all sequences, we search through the ‘most probable’ sequences. For e.g. for a beam width of 3, we might look at the top-10 most probable choices in each of the 3 stages. So we would search through 10*10*10 beams every iteration. This certainly cuts down the delay problem by a great deal, but the result quickly falls into a cycle, with repeating patters. For e.g, two beam patterns ‘.. ‘ and ‘…’ repeating endlessly. Greedy beam search: In order to add more diversity, I multiplied individual probabilities instead of adding them, while calculating the beam score. The results were more diverse, but not necessarily better. [i, j, k], i_score * j_score * k_score] References: _fastai is the first deep learning library to provide a single consistent interface to all the most commonly used deep…_www.fast.ai fast.ai · Making neural nets uncool again _Natural language processing tasks, such as caption generation and machine translation, involve generating sequences of…_machinelearningmastery.com How to Implement a Beam Search Decoder for Natural Language Processing https://github.com/coderbee/jokes/ Short Jokes dataset

BUNCH

Too Long; Didn't Read

In your car, at home, or at work — Bosch technology shapes many areas of life.

Text Generation for Char LSTM models

Too Long; Didn't Read

Companies Mentioned

Nikhil B

About Author

TOPICS

THIS ARTICLE WAS FEATURED IN...

Text Generation for Char LSTM models

Too Long; Didn't Read

Companies Mentioned

Nikhil B

About Author

TOPICS

THIS ARTICLE WAS FEATURED IN...

RELATED STORIES