Learn how to summarize text in this article by Rajdeep Dua who currently leads the developer relations team at Salesforce India, and Manpreet Singh Ghotra who is currently working at Salesforce developing a machine learning platform/APIs. Text summarization is a method in natural language processing (NLP) for generating a short and precise summary of a reference document. Producing a summary of a large document manually is a very difficult task. Summarization of a text using machine learning techniques is still an active research topic. Before proceeding to discuss text summarization and how we do it, here is a definition of summary. A summary is a text output that is generated from one or more texts that conveys relevant information from the original text in a shorter form. The goal of automatic text summarization is to transform the source text into a shorter version using semantics. Lately, various approaches have been developed for automated text summarization using NLP techniques, and they have been implemented widely in various domains. Some examples include search engines creating summaries for use in previews of documents and news websites producing consolidated descriptions of news topics, usually as headlines, to help users browse. To summarize text effectively, deep learning models need to be able to understand documents and discern and distill the important information. These methods are highly challenging and complex, particularly as the length of a document increases. Text summarization for reviews This article will show you how to work on the problem of text summarization to create relevant summaries for product reviews about fine food sold on the world’s largest e-commerce platform, Amazon. Reviews include product and user information, ratings, and a plain text review. It also includes reviews from all other Amazon categories. Develop a basic character-level ( ) model by defining the encoder-decoder ( ) architecture. sequence-to-sequence seq2seq recurrent neural network RNN The dataset used in this article can be found at . Your dataset will include the following: https://www.kaggle.com/snap/amazon-fine-food-reviews/ · 568,454 reviews · 256,059 users · 74,258 products How to do it… You’ll develop a modeling pipeline and encoder-decoder architecture that tries to create relevant summaries for a given set of reviews. The modeling pipelines use RNN models written using the Keras functional API. The pipelines also use various data manipulation libraries. The encoder-decoder architecture is used as a way of building RNNs for sequence predictions. It involves two major components: an encoder and a decoder. The encoder reads the complete input sequence and encodes it into an internal representation, usually a fixed-length vector, described as the context vector. The decoder, on the other hand, reads the encoded input sequence from the encoder and generates the output sequence. Various types of encoders can be used — more commonly, bidirectional RNNs, such as LSTMs, are used. Data processing It is crucial that you serve the right data as input to the neural architecture for training and validation. Make sure that data is on a useful scale and format, and that meaningful features are included. This will lead to better and more consistent results. Employ the following workflow for data preprocessing: 1. Load the dataset using pandas 2. Split the dataset into input and output variables for machine learning 3. Apply a preprocessing transform to the input variables 4. Summarize the data to show the change Now get started step by step: 1. Get started by importing important packages and your dataset. Use the library to load data and review the shape of your dataset—it includes 10 features and 5 million data points: pandas import pandas as pd import re from nltk.corpus import stopwords from pickle import dump, load reviews = pd.read_csv("/deeplearning-keras/ch09/summarization/Reviews.csv") print(reviews.shape) print(reviews.head()) print(reviews.isnull().sum()) The output will be as follows: (568454, 10)Id 0 ProductId 0 UserId 0 ProfileName 16 HelpfulnessNumerator 0 HelpfulnessDenominator 0 Score 0 Time 0 Summary 27 Text 0 2. Remove null values and unneeded features, as shown in the following snippet: reviews = reviews.dropna() reviews = reviews.drop(['Id','ProductId','UserId','ProfileName','HelpfulnessNumerator','HelpfulnessDenominator', 'Score','Time'], 1) reviews = reviews.reset_index(drop=True) print(reviews.head()) for i in range(5): print("Review #",i+1) print(reviews.Summary[i]) print(reviews.Text[i]) print() The output will be as follows: Summary Text 0 Good Quality Dog Food I have bought several of the Vitality canned d... 1 Not as Advertised Product arrived labeled as Jumbo Salted Peanut... 2 "Delight," says it all This is a confection that has been around a fe... 3 Cough Medicine If you are looking for the secret ingredient i... Review # 1 Not as Advertised - Product arrived labeled as Jumbo Salted Peanuts...the peanuts were actually small sized unsalted. Not sure if this was an error or if the vendor intended to represent the product as "Jumbo". Review # 2 "Delight" says it all - This is a confection that has been around a few centuries. It is a light, pillowy citrus gelatin with nuts - in this case, Filberts. And it is cut into tiny squares and then liberally coated with powdered sugar. And it is a tiny mouthful of heaven. Not too chewy, and very flavorful. I highly recommend this yummy treat. If you are familiar with the story of C.S. Lewis' "The Lion, The Witch, and The Wardrobe" - this is the treat that seduces Edmund into selling out his Brother and Sisters to the Witch. Review # 3 Cough Medicine - If you are looking for the secret ingredient in Robitussin I believe I have found it. I got this in addition to the Root Beer Extract I ordered (which was good) and made some cherry soda. The flavor is very medicinal. By definition, a contraction is the combination of two words into a reduced form, with the omission of some internal letters and the use of an apostrophe. You can get the list of from . contractions http://stackoverflow.com/questions/19790188/expanding-english-language-contractions-in-python 3. Replace contractions with their longer forms, as shown here: contractions = { "ain't": "am not", "aren't": "are not", "can't": "cannot", "can't've": "cannot have", "'cause": "because", "could've": "could have", "couldn't": "could not", "couldn't've": "could not have", "didn't": "did not", "doesn't": "does not", "don't": "do not", "hadn't": "had not", "hadn't've": "had not have", "hasn't": "has not", "haven't": "have not", "he'd": "he would", "he'd've": "he would have", 4. Clean the text documents by replacing contractions and removing stop words: def # Convert words to lower case text = text.lower() if True: text = text.split() new_text = [] for word in text: if word in contractions: new_text.append(contractions[word]) else: new_text.append(word) text = " ".join(new_text) text = re.sub(r'https?:\/\/.*[\r\n]*', '', text, flags=re.MULTILINE) text = re.sub(r'\<a href', ' ', text) text = re.sub(r' text = re.sub(r'[_"\-;%()|+&=*%.,!?:#$@\[\]/]', ' ', text) text = re.sub(r' text = re.sub(r'\'', ' ', text) if remove_stopwords: text = text.split() stops = set(stopwords.words("english")) text = [w for w in text if not w in stops] text = " ".join(text) return text 5. Remove unwanted characters and, optionally, stop words. Also, make sure to replace the , as shown previously. You can get the list of stop words from ( ), which helps with splitting sentences from paragraphs, splitting up words, and recognizing parts of speech. Import the toolkit using the following commands: contractions Natural Language Toolkit NLTK import nltk nltk.download('stopwords') 6. Clean the summaries as shown in the following snippet: # Clean the summaries and texts clean_summaries = [] for summary in reviews.Summary: clean_summaries.append(clean_text(summary, remove_stopwords=False)) print("Summaries are complete.") clean_texts = [] for text in reviews.Text: clean_texts.append(clean_text(text)) print("Texts are complete.") 7. Finally, save all the reviews into a file. serializes objects so they can be saved to a file and loaded in a program again later on: pickle pickle stories = list() for i, text in enumerate(clean_texts): stories.append({'story': text, 'highlights': clean_summaries[i]}) # save to file dump(stories, open('/deeplearning-keras/ch09/summarization/review_dataset.pkl', 'wb')) Encoder-decoder architecture Develop a basic character-level seq2seq model for text summarization. Use a word-level model, which is quite common in the domain of text processing. For this article, use character level models. As mentioned earlier, encoder and decoder architecture is a way of creating RNNs for sequence prediction. Encoders read the entire input sequence and encode it into an internal representation, usually, a fixed-length vector named the context vector. The decoder, on the other hand, reads the encoded input sequence from the encoder and produces the output sequence. The encoder-decoder architecture consists of two primary models: one reads the input sequence and encodes it to a fixed-length vector, and the second decodes the fixed-length vector and outputs the predicted sequence. This architecture is designed for seq2seq problems. 1. Firstly, define the hyperparameters such as batch size, number of epochs for training, and number of samples to train: batch_size = 64 epochs = 110 latent_dim = 256 num_samples = 10000 2. Next, load the dataset from the file: review pickle stories = load(open('/deeplearning-keras/ch09/summarization/review_dataset.pkl', 'rb')) print('Loaded Stories %d' % len(stories)) print(type(stories)) The output will be as follows: Loaded Stories 3. Then, vectorize the data: input_texts = [] target_texts = [] input_characters = set() target_characters = set() for story in stories: input_text = story['story'] for highlight in story['highlights']: target_text = highlight # We use "tab" as the "start sequence" character # for the targets, and "\n" as "end sequence" character. target_text = '\t' + target_text + '\n' input_texts.append(input_text) target_texts.append(target_text) for char in input_text: if char not in input_characters: input_characters.add(char) for char in target_text: if char not in target_characters: target_characters.add(char) input_characters = sorted(list(input_characters)) target_characters = sorted(list(target_characters)) num_encoder_tokens = len(input_characters) num_decoder_tokens = len(target_characters) max_encoder_seq_length = max([len(txt) for txt in input_texts]) max_decoder_seq_length = max([len(txt) for txt in target_texts]) print('Number of samples:', len(input_texts)) print('Number of unique input tokens:', num_encoder_tokens) print('Number of unique output tokens:', num_decoder_tokens) print('Max sequence length for inputs:', max_encoder_seq_length) print('Max sequence length for outputs:', max_decoder_seq_length) The output will be as follows: Number of samples: 568411 Number of unique input tokens: 84 Number of unique output tokens: 48 Max sequence length for inputs: 15074 Max sequence length for outputs: 5 4. Now, create a generic function to define an encoder-decoder RNN: def # define training encoder encoder_inputs = Input(shape=(None, n_input)) encoder = LSTM(n_units, return_state=True) encoder_outputs, state_h, state_c = encoder(encoder_inputs) encoder_states = [state_h, state_c] # define training decoder decoder_inputs = Input(shape=(None, n_output)) decoder_lstm = LSTM(n_units, return_sequences=True, return_state=True) decoder_outputs, _, _ = decoder_lstm(decoder_inputs, initial_state=encoder_states) decoder_dense = Dense(n_output, activation='softmax') decoder_outputs = decoder_dense(decoder_outputs) model = Model([encoder_inputs, decoder_inputs], decoder_outputs) # define inference encoder encoder_model = Model(encoder_inputs, encoder_states) # define inference decoder decoder_state_input_h = Input(shape=(n_units,)) decoder_state_input_c = Input(shape=(n_units,)) decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c] decoder_outputs, state_h, state_c = decoder_lstm(decoder_inputs, initial_state=decoder_states_inputs) decoder_states = [state_h, state_c] decoder_outputs = decoder_dense(decoder_outputs) decoder_model = Model([decoder_inputs] + decoder_states_inputs, [decoder_outputs] + decoder_states) # return all models return model, encoder_model, decoder_model Training 1. For running the training, use optimizer and as the function: rmsprop categorical_crossentropy loss # Run training model.compile(optimizer='rmsprop', loss='categorical_crossentropy') model.fit([encoder_input_data, decoder_input_data], decoder_target_data, batch_size=batch_size, epochs=epochs, validation_split=0.2) # Save model model.save('/deeplearning-keras/ch09/summarization/model2.h5') The output will be as follows: 64/800 [=>............................] - ETA: 22:05 - loss: 2.1460 128/800 [===>..........................] - ETA: 18:51 - loss: 2.1234 192/800 [======>.......................] - ETA: 16:36 - loss: 2.0878 256/800 [========>.....................] - ETA: 14:38 - loss: 2.1215 320/800 [===========>..................] - ETA: 12:47 - loss: 1.9832 384/800 [=============>................] - ETA: 11:01 - loss: 1.8665 448/800 [===============>..............] - ETA: 9:17 - loss: 1.7547 512/800 [==================>...........] - ETA: 7:35 - loss: 1.6619 576/800 [====================>.........] - ETA: 5:53 - loss: 1.5820 512/800 [==================>...........] - ETA: 7:19 - loss: 0.7519 576/800 [====================>.........] - ETA: 5:42 - loss: 0.7493 640/800 [=======================>......] - ETA: 4:06 - loss: 0.7528 704/800 [=========================>....] - ETA: 2:28 - loss: 0.7553 768/800 [===========================>..] - ETA: 50s - loss: 0.7554 2. For inference, use the following method: # generate target given source sequence def predict_sequence(infenc, infdec, source, n_steps, cardinality): # encode state = infenc.predict(source) # start of sequence input target_seq = array([0.0 for _ in range(cardinality)]).reshape(1, 1, cardinality) # collect predictions output = list() for t in range(n_steps): # predict next char yhat, h, c = infdec.predict([target_seq] + state) # store prediction output.append(yhat[0,0,:]) # update state state = [h, c] # update target sequence target_seq = yhat return array(output) The output will be as follows: Review(1): The coffee tasted great and was at such a good price! I highly recommend this to everyone! Summary(1): great coffee Review(2): This is the worst cheese that I have ever bought! I will never buy it again and I hope you won't either! Summary(2): omg gross gross Review(3): love individual oatmeal cups found years ago sam quit selling sound big lots quit selling found target expensive buy individually trilled get entire case time go anywhere need water microwave spoon to know quaker flavor packets Summary(3): love it If you found this article interesting, you can explore Keras Deep Learning Cookbook to leverage the power of deep learning and Keras to develop smarter and more efficient data models. Keras Deep Learning Cookbook shows you how to tackle different problems encountered while training efficient deep learning models, with the help of the popular Keras library.