Learn how to summarize text in this article by Rajdeep Dua who currently leads the developer relations team at Salesforce India, and Manpreet Singh Ghotra who is currently working at Salesforce developing a machine learning platform/APIs.
Text summarization is a method in natural language processing (NLP) for generating a short and precise summary of a reference document. Producing a summary of a large document manually is a very difficult task. Summarization of a text using machine learning techniques is still an active research topic. Before proceeding to discuss text summarization and how we do it, here is a definition of summary.
A summary is a text output that is generated from one or more texts that conveys relevant information from the original text in a shorter form. The goal of automatic text summarization is to transform the source text into a shorter version using semantics.
Lately, various approaches have been developed for automated text summarization using NLP techniques, and they have been implemented widely in various domains. Some examples include search engines creating summaries for use in previews of documents and news websites producing consolidated descriptions of news topics, usually as headlines, to help users browse.
To summarize text effectively, deep learning models need to be able to understand documents and discern and distill the important information. These methods are highly challenging and complex, particularly as the length of a document increases.
This article will show you how to work on the problem of text summarization to create relevant summaries for product reviews about fine food sold on the world’s largest e-commerce platform, Amazon. Reviews include product and user information, ratings, and a plain text review. It also includes reviews from all other Amazon categories. Develop a basic character-level sequence-to-sequence (seq2seq) model by defining the encoder-decoder recurrent neural network (RNN) architecture.
The dataset used in this article can be found at https://www.kaggle.com/snap/amazon-fine-food-reviews/. Your dataset will include the following:
· 568,454 reviews
· 256,059 users
· 74,258 products
You’ll develop a modeling pipeline and encoder-decoder architecture that tries to create relevant summaries for a given set of reviews. The modeling pipelines use RNN models written using the Keras functional API. The pipelines also use various data manipulation libraries.
The encoder-decoder architecture is used as a way of building RNNs for sequence predictions. It involves two major components: an encoder and a decoder. The encoder reads the complete input sequence and encodes it into an internal representation, usually a fixed-length vector, described as the context vector. The decoder, on the other hand, reads the encoded input sequence from the encoder and generates the output sequence. Various types of encoders can be used — more commonly, bidirectional RNNs, such as LSTMs, are used.
It is crucial that you serve the right data as input to the neural architecture for training and validation. Make sure that data is on a useful scale and format, and that meaningful features are included. This will lead to better and more consistent results.
Employ the following workflow for data preprocessing:
1. Load the dataset using pandas
2. Split the dataset into input and output variables for machine learning
3. Apply a preprocessing transform to the input variables
4. Summarize the data to show the change
Now get started step by step:
1. Get started by importing important packages and your dataset. Use the pandas
library to load data and review the shape of your dataset—it includes 10 features and 5 million data points:
import pandas as pd
import re
from nltk.corpus import stopwords
from pickle import dump, load
reviews = pd.read_csv("/deeplearning-keras/ch09/summarization/Reviews.csv")
print(reviews.shape)
print(reviews.head())
print(reviews.isnull().sum())
The output will be as follows:
(568454, 10)Id 0
ProductId 0
UserId 0
ProfileName 16
HelpfulnessNumerator 0
HelpfulnessDenominator 0
Score 0
Time 0
Summary 27
Text 0
2. Remove null values and unneeded features, as shown in the following snippet:
reviews = reviews.dropna()
reviews = reviews.drop(['Id','ProductId','UserId','ProfileName','HelpfulnessNumerator','HelpfulnessDenominator', 'Score','Time'], 1)
reviews = reviews.reset_index(drop=True) print(reviews.head())
for i in range(5):
print("Review #",i+1)
print(reviews.Summary[i])
print(reviews.Text[i])
print()
The output will be as follows:
Summary Text
0 Good Quality Dog Food I have bought several of the Vitality canned d...
1 Not as Advertised Product arrived labeled as Jumbo Salted Peanut...
2 "Delight," says it all This is a confection that has been around a fe...
3 Cough Medicine If you are looking for the secret ingredient i...
Review # 1
Not as Advertised - Product arrived labeled as Jumbo Salted Peanuts...the peanuts were actually small sized unsalted. Not sure if this was an error or if the vendor intended to represent the product as "Jumbo".
Review # 2
"Delight" says it all - This is a confection that has been around a few centuries. It is a light, pillowy citrus gelatin with nuts - in this case, Filberts. And it is cut into tiny squares and then liberally coated with powdered sugar. And it is a tiny mouthful of heaven. Not too chewy, and very flavorful. I highly recommend this yummy treat. If you are familiar with the story of C.S. Lewis' "The Lion, The Witch, and The Wardrobe" - this is the treat that seduces Edmund into selling out his Brother and Sisters to the Witch.
Review # 3
Cough Medicine - If you are looking for the secret ingredient in Robitussin I believe I have found it. I got this in addition to the Root Beer Extract I ordered (which was good) and made some cherry soda. The flavor is very medicinal.
By definition, a contraction is the combination of two words into a reduced form, with the omission of some internal letters and the use of an apostrophe. You can get the list of contractions
from http://stackoverflow.com/questions/19790188/expanding-english-language-contractions-in-python.
3. Replace contractions with their longer forms, as shown here:
contractions = {
"ain't": "am not",
"aren't": "are not",
"can't": "cannot",
"can't've": "cannot have",
"'cause": "because",
"could've": "could have",
"couldn't": "could not",
"couldn't've": "could not have",
"didn't": "did not",
"doesn't": "does not",
"don't": "do not",
"hadn't": "had not",
"hadn't've": "had not have",
"hasn't": "has not",
"haven't": "have not",
"he'd": "he would",
"he'd've": "he would have",
4. Clean the text documents by replacing contractions and removing stop words:
def
# Convert words to lower case
text = text.lower()
if True:
text = text.split()
new_text = []
for word in text:
if word in contractions:
new_text.append(contractions[word])
else:
new_text.append(word)
text = " ".join(new_text)
text = re.sub(r'https?:\/\/.*[\r\n]*', '', text, flags=re.MULTILINE)
text = re.sub(r'\<a href', ' ', text)
text = re.sub(r'
text = re.sub(r'[_"\-;%()|+&=*%.,!?:#$@\[\]/]', ' ', text)
text = re.sub(r'
text = re.sub(r'\'', ' ', text)
if remove_stopwords:
text = text.split()
stops = set(stopwords.words("english"))
text = [w for w in text if not w in stops]
text = " ".join(text)
return text
5. Remove unwanted characters and, optionally, stop words. Also, make sure to replace the contractions
, as shown previously. You can get the list of stop words from Natural Language Toolkit (NLTK), which helps with splitting sentences from paragraphs, splitting up words, and recognizing parts of speech. Import the toolkit using the following commands:
import nltk
nltk.download('stopwords')
6. Clean the summaries as shown in the following snippet:
# Clean the summaries and texts
clean_summaries = []
for summary in reviews.Summary:
clean_summaries.append(clean_text(summary, remove_stopwords=False))
print("Summaries are complete.")
clean_texts = []
for text in reviews.Text:
clean_texts.append(clean_text(text))
print("Texts are complete.")
7. Finally, save all the reviews into a pickle
file. pickle
serializes objects so they can be saved to a file and loaded in a program again later on:
stories = list()
for i, text in enumerate(clean_texts):
stories.append({'story': text, 'highlights': clean_summaries[i]})
# save to file
dump(stories, open('/deeplearning-keras/ch09/summarization/review_dataset.pkl', 'wb'))
Develop a basic character-level seq2seq model for text summarization. Use a word-level model, which is quite common in the domain of text processing. For this article, use character level models. As mentioned earlier, encoder and decoder architecture is a way of creating RNNs for sequence prediction. Encoders read the entire input sequence and encode it into an internal representation, usually, a fixed-length vector named the context vector. The decoder, on the other hand, reads the encoded input sequence from the encoder and produces the output sequence.
The encoder-decoder architecture consists of two primary models: one reads the input sequence and encodes it to a fixed-length vector, and the second decodes the fixed-length vector and outputs the predicted sequence. This architecture is designed for seq2seq problems.
1. Firstly, define the hyperparameters such as batch size, number of epochs for training, and number of samples to train:
batch_size = 64
epochs = 110
latent_dim = 256
num_samples = 10000
2. Next, load the review
dataset from the pickle
file:
stories = load(open('/deeplearning-keras/ch09/summarization/review_dataset.pkl', 'rb'))
print('Loaded Stories %d' % len(stories))
print(type(stories))
The output will be as follows:
Loaded Stories
3. Then, vectorize the data:
input_texts = []
target_texts = []
input_characters = set()
target_characters = set()
for story in stories:
input_text = story['story']
for highlight in story['highlights']:
target_text = highlight
# We use "tab" as the "start sequence" character
# for the targets, and "\n" as "end sequence" character.
target_text = '\t' + target_text + '\n'
input_texts.append(input_text)
target_texts.append(target_text)
for char in input_text:
if char not in input_characters:
input_characters.add(char)
for char in target_text:
if char not in target_characters:
target_characters.add(char)
input_characters = sorted(list(input_characters))
target_characters = sorted(list(target_characters))
num_encoder_tokens = len(input_characters)
num_decoder_tokens = len(target_characters)
max_encoder_seq_length = max([len(txt) for txt in input_texts])
max_decoder_seq_length = max([len(txt) for txt in target_texts])
print('Number of samples:', len(input_texts))
print('Number of unique input tokens:', num_encoder_tokens)
print('Number of unique output tokens:', num_decoder_tokens)
print('Max sequence length for inputs:', max_encoder_seq_length)
print('Max sequence length for outputs:', max_decoder_seq_length)
The output will be as follows:
Number of samples: 568411
Number of unique input tokens: 84
Number of unique output tokens: 48
Max sequence length for inputs: 15074
Max sequence length for outputs: 5
4. Now, create a generic function to define an encoder-decoder RNN:
def
# define training encoder
encoder_inputs = Input(shape=(None, n_input))
encoder = LSTM(n_units, return_state=True)
encoder_outputs, state_h, state_c = encoder(encoder_inputs)
encoder_states = [state_h, state_c]
# define training decoder
decoder_inputs = Input(shape=(None, n_output))
decoder_lstm = LSTM(n_units, return_sequences=True, return_state=True)
decoder_outputs, _, _ = decoder_lstm(decoder_inputs, initial_state=encoder_states)
decoder_dense = Dense(n_output, activation='softmax')
decoder_outputs = decoder_dense(decoder_outputs)
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
# define inference encoder
encoder_model = Model(encoder_inputs, encoder_states)
# define inference decoder
decoder_state_input_h = Input(shape=(n_units,))
decoder_state_input_c = Input(shape=(n_units,))
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]
decoder_outputs, state_h, state_c = decoder_lstm(decoder_inputs, initial_state=decoder_states_inputs)
decoder_states = [state_h, state_c]
decoder_outputs = decoder_dense(decoder_outputs)
decoder_model = Model([decoder_inputs] + decoder_states_inputs, [decoder_outputs] + decoder_states)
# return all models
return model, encoder_model, decoder_model
1. For running the training, use rmsprop
optimizer and categorical_crossentropy
as the loss
function:
# Run training
model.compile(optimizer='rmsprop', loss='categorical_crossentropy')
model.fit([encoder_input_data, decoder_input_data], decoder_target_data,
batch_size=batch_size,
epochs=epochs,
validation_split=0.2)
# Save model
model.save('/deeplearning-keras/ch09/summarization/model2.h5')
The output will be as follows:
64/800 [=>............................] - ETA: 22:05 - loss: 2.1460
128/800 [===>..........................] - ETA: 18:51 - loss: 2.1234
192/800 [======>.......................] - ETA: 16:36 - loss: 2.0878
256/800 [========>.....................] - ETA: 14:38 - loss: 2.1215
320/800 [===========>..................] - ETA: 12:47 - loss: 1.9832
384/800 [=============>................] - ETA: 11:01 - loss: 1.8665
448/800 [===============>..............] - ETA: 9:17 - loss: 1.7547
512/800 [==================>...........] - ETA: 7:35 - loss: 1.6619
576/800 [====================>.........] - ETA: 5:53 - loss: 1.5820
512/800 [==================>...........] - ETA: 7:19 - loss: 0.7519
576/800 [====================>.........] - ETA: 5:42 - loss: 0.7493
640/800 [=======================>......] - ETA: 4:06 - loss: 0.7528
704/800 [=========================>....] - ETA: 2:28 - loss: 0.7553
768/800 [===========================>..] - ETA: 50s - loss: 0.7554
2. For inference, use the following method:
# generate target given source sequence
def predict_sequence(infenc, infdec, source, n_steps, cardinality):
# encode
state = infenc.predict(source)
# start of sequence input
target_seq = array([0.0 for _ in range(cardinality)]).reshape(1, 1, cardinality)
# collect predictions
output = list()
for t in range(n_steps):
# predict next char
yhat, h, c = infdec.predict([target_seq] + state)
# store prediction
output.append(yhat[0,0,:])
# update state
state = [h, c]
# update target sequence
target_seq = yhat
return array(output)
The output will be as follows:
Review(1): The coffee tasted great and was at such a good price! I highly recommend this to everyone!
Summary(1): great coffee
Review(2): This is the worst cheese that I have ever bought! I will never buy it again and I hope you won't either!
Summary(2): omg gross gross
Review(3): love individual oatmeal cups found years ago sam quit selling sound big lots quit selling found target expensive buy individually trilled get entire case time go anywhere need water microwave spoon to know quaker flavor packets
Summary(3): love it
If you found this article interesting, you can explore Keras Deep Learning Cookbook to leverage the power of deep learning and Keras to develop smarter and more efficient data models. Keras Deep Learning Cookbook shows you how to tackle different problems encountered while training efficient deep learning models, with the help of the popular Keras library.