“You don’t need a silver fork to eat good food.” 🎬 Introduction Wine Reviews In this article, I will try to explore the Wine Reviews Dataset. It contains 130k of reviews in Wine Reviews. And at the end of this article, I will try to make simple text summarizer that will summarize given reviews. The summarized reviews can be used as a reviews title also.I will use spaCy as natural language processing library for handling this project. 📋 Object Of This Project The objective of this project is to build a model that can create relevant summaries for reviews written on Wine reviews. This dataset contains above 130k reviews, and is hosted on Kaggle. Kaggle What Is Text Summarization? Text summarization is the process of distilling the most important information from a source (or sources) to produce an abridged version for a particular user (or users) and task (or tasks). Text summarization is the process of distilling the most important information from a source (or sources) to produce an abridged version for a particular user (or users) and task (or tasks). Why we need text summarization? Imgur Imgur In Recent Years we are witnessing the amount of textual information is increasing day by day .The Textual Information grows rapidly. It becomes more difficult for the user to read the textual information and also it leads to loss of interest. That is the reason why Text Summarization came into picture which will solve this problem. Types of Text Summarization Methods Text summarization methods can be classified into different types. i. Based on input type: i. Based on input type: Single Document, where the input length is short. Many of the early summarization systems dealt with single document summarization.
Multi Document, where the input can be arbitrarily long. Single Document, where the input length is short. Many of the early summarization systems dealt with single document summarization. Multi Document, where the input can be arbitrarily long. ii. Based on the purpose: ii. Based on the purpose: Generic, where the model makes no assumptions about the domain or content of the text to be summarized and treats all inputs as homogeneous. The majority of the work that has been done revolves around generic summarization.
Domain-specific, where the model uses domain-specific knowledge to form a more accurate summary. For example, summarizing research papers of a specific domain, biomedical documents, etc.
Query-based, where the summary only contains information which answers natural language questions about the input text. Generic, where the model makes no assumptions about the domain or content of the text to be summarized and treats all inputs as homogeneous. The majority of the work that has been done revolves around generic summarization. Domain-specific, where the model uses domain-specific knowledge to form a more accurate summary. For example, summarizing research papers of a specific domain, biomedical documents, etc. Query-based, where the summary only contains information which answers natural language questions about the input text. iii. Based on output type: iii. Based on output type: Extractive, where important sentences are selected from the input text to form a summary. Most summarization approaches today are extractive in nature.
Abstractive, where the model forms its own phrases and sentences to offer a more coherent summary, like what a human would generate. This approach is definitely a more appealing, but much more difficult than extractive summarization. Extractive, where important sentences are selected from the input text to form a summary. Most summarization approaches today are extractive in nature. Abstractive, where the model forms its own phrases and sentences to offer a more coherent summary, like what a human would generate. This approach is definitely a more appealing, but much more difficult than extractive summarization. Prerequisites This article makes the following assumptions: You are familar with Python
You have Python 3.6 or greater installed on your system
spaCy package. You are familar with Python You have Python 3.6 or greater installed on your system spaCy package. What is spaCy? spaCy is a relatively new package for “Industrial strength NLP in Python” developed by Matt Honnibal at explosion.ai. It is designed with the applied data scientist in mind, meaning it does not weigh the user down with decisions over what esoteric algorithms to use for common tasks and it’s fast. Incredibly fast (it’s implemented in Cython). If you are familiar with the Python data science stack, spaCy is your numpy for NLP – it’s reasonably low-level, but very intuitive and performant.However, since SpaCy is a relative new NLP library, and it’s not as widely adopted as NLTK. explosion.ai numpy Installation of spaCy **spaCy**, its data, and its models can be easily installed using python package index and setup tools. Use the following command to install spacy in your machine: **spaCy** ! pip install spacy In case of Python3, replace “pip” with “pip3” in the above command. OR download the source from here and run the following command, after unzipping: here !python setup.py install To download all the data and models, run the following command, after the installation: !python -m spacy.en.download all You are now all set to explore and use spacy. Loading spaCy Libraries import spacy Implementation Section 1. Import Packages import numpy as np # linear algebraimport spacynlp = spacy.load('en_core_web_sm')import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)import seaborn as snsimport matplotlib.pyplot as pltfrom wordcloud import WordCloudimport stringimport refrom collections import Counterfrom time import time# from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS as stopwordsfrom nltk.corpus import stopwordsimport nltkimport plotly.offline as pyimport plotly.graph_objs as goimport plotly.tools as tls%matplotlib inline # linear algebra # data processing, CSV file I/O (e.g. pd.read_csv)import seaborn as snsimport matplotlib.pyplot as pltfrom wordcloud import WordCloudimport stringimport refrom collections import Counterfrom time import time stopwords = stopwords.words('english')sns.set_context('notebook') 2. Import Dataset In this section, I will load the desired dataset for this notebook. This dataset has huge number of reviews. It will be hard to work with full dataset. So I will randomly sample the dataset into smaller chunks for easy purpose. reviews = pd.read_csv("../input/winemag-data-130k-v2.csv", nrows=5000,usecols =['points', 'title', 'description'],encoding='latin1')reviews = reviews.dropna()reviews.head(15) 3. Text preprocessing In this step, I will be using Spacy for preprocessing text, in others words I will clearing not useful features from reviews title like punctuation, stopwords. For this task, there are two useful libraries available in Python. 1. NLTK 2. Spacy. In this notebook, I will be working with Spacy because it is very fast and has many useful features compared to NLTK. So without further do let’s get started! !python -m spacy download en_core_web_lgnlp = spacy.load('en_core_web_lg')def normalize_text(text):tm1 = re.sub('<pre>.*?</pre>', '', text, flags=re.DOTALL)tm2 = re.sub('<code>.*?</code>', '', tm1, flags=re.DOTALL)tm3 = re.sub('<[^>]+>©', '', tm1, flags=re.DOTALL)return tm3.replace("\n", "") \n Output screen: # in this step we are going to remove code syntax from textreviews['description_Cleaned_1'] = reviews['description'].apply(normalize_text) # in this step we are going to remove code syntax from text print('Before normalizing text-----\n')print(reviews['description'][2])print('\nAfter normalizing text-----\n')print(reviews['description_Cleaned_1'][2]) \n \n \n Output screen: We can see a huge difference after normalizing our text. Now we can see our text is more manageable. This will help us to explore the reviews and later making summarizer. We are also seeing that there are some punctuation and stopwords. We also don’t need them. In the first place, I don’t remove them because we are gonna need this in future when we will make summarizer. So let’s make another column that will store our normalized text without punctuation and stopwords. 3.1 Clean text before feeding it to spaCy punctuations = '!"#$%&\'()*+,-/:;<=>?@[\\]^`{|}~©'# Define function to cleanup text by removing personal pronouns, stopwords, and puncuationdef cleanup_text(docs, logging=False):texts = []doc = nlp(docs, disable=['parser', 'ner'])tokens = [tok.lemma.lower().strip() for tok in doc if tok.lemma_ != '-PRON-']tokens = [tok for tok in tokens if tok not in stopwords and tok not in punctuations]tokens = ' '.join(tokens)texts.append(tokens)return pd.Series(texts)reviews['Description_Cleaned'] = reviews['description_Cleaned_1'].apply(lambda x: cleanup_text(x, False)) \' \\ `{|}~©'# Define function to cleanup text by removing personal pronouns, stopwords, and puncuation in in not in and not in print('Reviews description with punctuatin and stopwords---\n')print(reviews['description_Cleaned_1'][0])print('\nReviews description after removing punctuation and stopwrods---\n')print(reviews['Description_Cleaned'][0]) \n \n \n Output screen: Wow! See! Now our text looks much readable and less messy! 4. Distribution of Points In this section, I will try understand the distribution of points. Here points mean number of upvote the description got in social media(such as facebook,twitter etc). plt.subplot(1, 2, 1)(reviews['points']).plot.hist(bins=30, figsize=(30,5), edgecolor='white',range=[0,150])plt.xlabel('Number of points', fontsize=17)plt.ylabel('frequency', fontsize=17)plt.tick_params(labelsize=15)plt.title('Number of points description', fontsize=17)plt.show() Output screen: The description of points lies between 80 to 100 mostly. Majority of the description got points between 80 to 100. 5. Analyze reviews description In this section, I will try to analyze wine description. In Wine Reviews, the wine description plays a vital role. A good description can make your wine stand out. It also helps get a reviews faster. Lastly, It will help you get some points. Let’s see what we can find in the wine description. reviews['Title_len'] = reviews['Description_Cleaned'].str.split().str.len()rev = reviews.groupby('Title_len')['points'].mean().reset_index()trace1 = go.Scatter(x = rev['Title_len'],y = rev['points'],mode = 'lines+markers',name = 'lines+markers')layout = dict(title= 'Average points by wine description Length',yaxis = dict(title='Average points'),xaxis = dict(title='wine description Length'))fig=dict(data=[trace1], layout=layout)py.iplot(fig) Output screen: 6. Description Summarizer Siraj Raval Siraj Raval In this step, I will try to make a description summarizer. There is a huge amount of research going for text summarization. But I will try to do a simple technique for text summarization. The technique describes below. In this step, I will try to make a description summarizer. There is a huge amount of research going for text summarization. But I will try to do a simple technique for text summarization. The technique describes below. 6.1 Convert Paragraphs to Sentences We first need to convert the whole paragraph into sentences. The most common way of converting paragraphs to sentences is to split the paragraph whenever a period is encountered. We first need to convert the whole paragraph into sentences. The most common way of converting paragraphs to sentences is to split the paragraph whenever a period is encountered. We first need to convert the whole paragraph into sentences. The most common way of converting paragraphs to sentences is to split the paragraph whenever a period is encountered. 6.2 Text Preprocessing After converting paragraph to sentences, we need to remove all the special characters, stop words and numbers from all the sentences. After converting paragraph to sentences, we need to remove all the special characters, stop words and numbers from all the sentences. After converting paragraph to sentences, we need to remove all the special characters, stop words and numbers from all the sentences. 6.3 Tokenizing the Sentences We need to tokenize all the sentences to get all the words that exist in the sentences We need to tokenize all the sentences to get all the words that exist in the sentences We need to tokenize all the sentences to get all the words that exist in the sentences 6.4 Find Weighted Frequency of Occurrence Next we need to find the weighted frequency of occurrences of all the words. We can find the weighted frequency of each word by dividing its frequency by the frequency of the most occurring word. Next we need to find the weighted frequency of occurrences of all the words. We can find the weighted frequency of each word by dividing its frequency by the frequency of the most occurring word. Next we need to find the weighted frequency of occurrences of all the words. We can find the weighted frequency of each word by dividing its frequency by the frequency of the most occurring word. 6.5 Replace Words by Weighted Frequency in Original Sentences The final step is to plug the weighted frequency in place of the corresponding words in original sentences and finding their sum. It is important to mention that weighted frequency for the words removed during preprocessing (stop words, punctuation, digits etc.) will be zero and therefore is not required to be added The final step is to plug the weighted frequency in place of the corresponding words in original sentences and finding their sum. It is important to mention that weighted frequency for the words removed during preprocessing (stop words, punctuation, digits etc.) will be zero and therefore is not required to be added The final step is to plug the weighted frequency in place of the corresponding words in original sentences and finding their sum. It is important to mention that weighted frequency for the words removed during preprocessing (stop words, punctuation, digits etc.) will be zero and therefore is not required to be added 6.6 Sort Sentences in Descending Order of Sum The final step is to sort the sentences in inverse order of their sum. The sentences with highest frequencies summarize the text. The final step is to sort the sentences in inverse order of their sum. The sentences with highest frequencies summarize the text. The final step is to sort the sentences in inverse order of their sum. The sentences with highest frequencies summarize the text. Function for text summarization: This function help for summarization from big text.So we need this function all time when we want to summarization from text.The function below here: def generate_summary(text_without_removing_dot, cleaned_text):sample_text = text_without_removing_dotdoc = nlp(sample_text)sentence_list=[]for idx, sentence in enumerate(doc.sents): # we are using spacy for sentence tokenizationsentence_list.append(re.sub(r'[^\w\s]','',str(sentence))) in # we are using spacy for sentence tokenization stopwords = nltk.corpus.stopwords.words('english')  

word\_frequencies = {}    
for word **in** nltk.word\_tokenize(cleaned\_text):    
    if word **not** **in** stopwords:  
        if word **not** **in** word\_frequencies.keys():  
            word\_frequencies\[word\] = 1  
        else:  
            word\_frequencies\[word\] += 1  


maximum\_frequncy = max(word\_frequencies.values())  

for word **in** word\_frequencies.keys():    
    word\_frequencies\[word\] = (word\_frequencies\[word\]/maximum\_frequncy)  


sentence\_scores = {}    
for sent **in** sentence\_list:    
    for word **in** nltk.word\_tokenize(sent.lower()):  
        if word **in** word\_frequencies.keys():  
            if len(sent.split(' ')) < 30:  
                if sent **not** **in** sentence\_scores.keys():  
                    sentence\_scores\[sent\] = word\_frequencies\[word\]  
                else:  
                    sentence\_scores\[sent\] += word\_frequencies\[word\]  


summary\_sentences = heapq.nlargest(7, sentence\_scores, key=sentence\_scores.get)  

summary = ' '.join(summary\_sentences)  
print("Original Text:**\\n**")  
print(text\_without\_removing\_dot)  
print('**\\n\\n**Summarized text:**\\n**')  
print(summary) stopwords = nltk.corpus.stopwords.words('english')  

word\_frequencies = {}    
for word **in** nltk.word\_tokenize(cleaned\_text):    
    if word **not** **in** stopwords:  
        if word **not** **in** word\_frequencies.keys():  
            word\_frequencies\[word\] = 1  
        else:  
            word\_frequencies\[word\] += 1  


maximum\_frequncy = max(word\_frequencies.values())  

for word **in** word\_frequencies.keys():    
    word\_frequencies\[word\] = (word\_frequencies\[word\]/maximum\_frequncy)  


sentence\_scores = {}    
for sent **in** sentence\_list:    
    for word **in** nltk.word\_tokenize(sent.lower()):  
        if word **in** word\_frequencies.keys():  
            if len(sent.split(' ')) < 30:  
                if sent **not** **in** sentence\_scores.keys():  
                    sentence\_scores\[sent\] = word\_frequencies\[word\]  
                else:  
                    sentence\_scores\[sent\] += word\_frequencies\[word\]  


summary\_sentences = heapq.nlargest(7, sentence\_scores, key=sentence\_scores.get)  

summary = ' '.join(summary\_sentences)  
print("Original Text:**\\n**")  
print(text\_without\_removing\_dot)  
print('**\\n\\n**Summarized text:**\\n**')  
print(summary) Now we have written the function let’s try to summarize some descriptions. generate_summary(reviews['description_Cleaned_1'][8], reviews['Description_Cleaned'][8]) Output screen: generate_summary(reviews['description_Cleaned_1'][100], reviews['Description_Cleaned'][100]) Output screen: generate_summary(reviews['description_Cleaned_1'][500], reviews['Description_Cleaned'][500]) Output screen: That’s awesome! We successfully made a simple winemag description summarizer. 7. Conclusion Thanks for reading this article. If you have any suggestion feel free to reach me in the comment or sent mail or connect on LinkedIn. Stay in touch for more update. Thank you. 😎 mail LinkedIn For the full code visit Kaggle . Kaggle “Let us celebrate the occasion with wine and sweet words.” “Let us celebrate the occasion with wine and sweet words.” If you like this article then give 👏 clap. Happy Analysis!

Facebook

Twitter

YouTube

📚 Summarization With Wine Reviews Using spaCy📋

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

Untitled Story

People are still crazy about Python after twenty-five years

LinkedIn

The Noonification: The Battle Between Proprietary and Open Source AI (11/3/2023)

The Noonification: Immigrant Teens Are Working Dangerous Night Shifts in Factories (11/21/2022)

The Noonification: Tired of ChatGPT? Try These 6 Alternatives Instead (11/10/2023)

People are still crazy about Python after twenty-five years

LinkedIn

The Noonification: The Battle Between Proprietary and Open Source AI (11/3/2023)

The Noonification: Immigrant Teens Are Working Dangerous Night Shifts in Factories (11/21/2022)

The Noonification: Tired of ChatGPT? Try These 6 Alternatives Instead (11/10/2023)

Light-Mode

Classic

Newspaper

Dark-Mode

Neon Noir

Minty

HN StartUps