TL; DR. After reading this article. You will be able to build a model to generate 5-star Yelp reviews like those.
Samples of generated review text (unmodified)
<SOR>I had the steak, mussels with a side of chicken parmesan. All were very good. We will be back.<EOR>
<SOR>The food, service, atmosphere, and service are excellent. I would recommend it to all my friends<EOR>
<SOR>Good atmosphere, amazing food and great service.Service is also pretty good. Give them a try!<EOR>
I will show you how to,
- Acquire and prepare the training data.
- Build the character-level language models.
- Tips when training the model.
- Generate random reviews.
Training the model could easily take up a couple of days even on GPU. Luckily the pre-trained model weights are available. So we could jump directly to the fun part to generate reviews.
Getting the Data ready
The Yelp Dataset is freely available in JSON format.
After downloading and extracting, you will find 2 files we need in the dataset folder,
Those two files are quite large, especially the review.json file (3.7 GB).
Each line of the review.json file is a review of JSON string. The two files do not have the JSON start and end square brackets “[ ]”. So the content of the JSON file as a whole is not a valid JSON string. Plus it might be difficult to fit the whole review.json file content to the memory. So, let’s first convert them to CSV format line by line with our helper script.
python json_converter.py ./dataset/review.json
python json_converter.py ./dataset/business.json
After that, you will find those two files in dataset folder,
Those two are valid CSV files we can open by pandas library.
Here is what we are going to do. We only extract 5-stars review texts from the businesses that have ‘Restaurant’ tag in their categories.
Next, let’s remove the new line characters in reviews and any duplicated reviews.
To show the model where is the start and end of a review. We need to add special markers to our review texts.
So one line in the finally prepared review will look like this as you expected.
"<SOR>Hummus is amazing and fresh! Loved the falafels. I will definitely be back. Great owner, friendly staff<EOR>"
Build the model
The model we are building here is a character-level language model, meaning the minimum distinguishable symbol is a character. You may also come across the word- level model where the input is the word tokens.
There are some pros and cons for the character-level language model.
- Don’t have to worry about unknown vocabulary.
- Able to learn large vocabulary.
- End up with very long sequences. Not as good as word level language models at capturing long-range dependencies between how the earlier parts of the sentence also affect the later part of the sentence.
- And character level models are also just more computationally expensive to train.
The model is quite similar to the official lstm_text_generation.py demo code, except we are stacking RNN cells allows storing more information throughout the hidden states between the input and output layer. It generates more realistic Yelp reviews.
Before showing the code for the model, let’s peek a little deeper on how stacking RNN works.
You may have seen in the standard neural network.(That is the Dense layers in Keras)
The first layer takes the input x to compute the activation value a, that stack next layer to compute the next activation value a.
Stacking RNN is a bit like the standard neural network and “unrolling in time”.
For notation a[l]<t> means activation asslocation for layer l, and <t> means timestep t.
Let’s take a look how an activation value is computed
To compute a<3>, there are two input, a<2> and a<3>
g is the activation function, wa and ba are the layer 2 parameters.
As we can see, to stack RNNs. The previous RNN need to return all the timesteps a<t>to the subsequent RNN.
By default, an RNN layer such as LSTM in Keras only returns the last timestep activation value a<T>. In order to return all timesteps’ activation values, we set the
return_sequences parameter to
So here is how we build the model in Keras. Each input sample is a one-hot representation of 60 characters, there are total 95 possible characters.
Each output is a list of 95 predicted probabilities for each character.
And here is the graphical model structure to help you visualize it.
Training the model
The idea to train the model is simple, we train it with the input/output pair. Each input is 60 characters, and the corresponding output is the immediately following character.
In the data preparing step, we created a list of clean 5-star reviews text. Total 1,214,016 lines of reviews. To simplify the training, we are only going to train on reviews equal or less than 250 characters long. Which end up with 418,955 lines of reviews.
Then we shuffle the order of the reviews so we don’t train on 100 reviews for the same restaurant in a row.
We read all reviews as a long text string. Then create a python dictionary (i.e., a hash table) to map each character to an index from 0–94 (total 95 unique characters).
The text corpus has a total of 72,662,807 characters. It is hard to process it as a whole. So let’s break it down into chunks of 90k characters each.
For each chunk of a corpus, we are going to generate pairs of inputs and outputs. By shifting the pointer from beginning to end of the chunk, one character at a time if step set to 1.
Training one chunk for one epoch takes 219 seconds on GPU (GTX1070), so training the full corpus will take about 2 days.
72662807 / 90000 * 219 /60 / 60/ 24 = 2.0 days
Two Keras callbacks come handy, ModelCheckpoint and ReduceLROnPlateau.
ModelCheckpoint helps us save the weights everytime it improves.
ReduceLROnPlateau callback automatically reduces learning rate when the loss metric stops decreasing. The main benefit of it is that we don’t need to manually tune the learning Rate. Its main weakness is that its learning rate is always decreasing and decaying.
Code to train the model for 20 epochs looks like this.
It will take one month or so as you might guess. But training for about 2 hours already produces some promising results in my case. Feel free to give it a try.
Generate 5-star reviews
Whether you jump right to this section or you have read through the previous ones. Here is the fun part!
With the pre-trained model weights or one you trained by yourself, we can generate some interesting yelp reviews.
Here is the idea, we “seed” the model with initial 60 characters and ask the model to predict the very next character.
The “sampling index” process will add some variety to the final result by generating some randomness with the given prediction.
If the temperature is very small, it will always pick the index with highest predicted probability.
To generate 300 characters with following code
Summary and Further reading
In this post, you know how to build and train a character-level text generation model from beginning to end. The source code is available on my GitHub repo as well as the pre-train model to play with.
The model shown here is trained in a many to one fashion. There is also another optional implementation in many to many fashion. Consider the input sequence as characters of length 7 “The cak” and the expected output is “he cake”. You can check it out here, char_rnn_karpathy_keras.
Originally published at www.dlology.com. For more practical deep learning experiences.