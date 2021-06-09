How Machines Learn Emotions: Sentiment Analysis of Amazon Product Reviews

Hey Folks! In this article, I walk you through the sentiment analysis of Amazon Electronics Product Reviews.

The dataset

Before we move forward, let’s download the dataset that we'll use in this project.

You can download the dataset from here: http://deepyeti.ucsd.edu/jianmo/amazon/categoryFilesSmall/Electronics_5.json.gz. The download size of the dataset is 1.2GB. The dataset is zipped, so you first need to unzip it. Now the size of the dataset is around 2.5GB. It may be possible that this dataset would not open in your Microsoft Excel.

If you still want to open you can use Delimit software for it. Here is the download link: http://delimitware.com/download.html.

Let’s analyze the dataset

The dataset contains these columns/features:

reviewerID — ID of the reviewer, e.g. A2SUAM1J3GNN3B

asin — ID of the product, e.g. 0000013714

reviewerName — name of the reviewer

vote — helpful votes of the review

style — product metadata, e.g., "Format" is "Hardcover"

reviewText — text of the review

overall — rating of the product

summary — summary of the review

unixReviewTime — time of the review (unix time)

reviewTime — time of the review (raw)

image — images that users post after they have received the product

The dataset has lots of features, but for sentiment analysis, we need review and rating.

Importing the libraries and the data

import numpy as np import pandas as pd import random import os import json import sys import gzip from collections import defaultdict import csv import time #nltk libraries and packages from nltk.tokenize import word_tokenize from nltk import pos_tag from nltk.corpus import stopwords from nltk.stem import WordNetLemmatizer from sklearn.preprocessing import LabelEncoder from nltk.corpus import wordnet as wn #Ml related libraries from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression as LR from sklearn.ensemble import RandomForestClassifier from sklearn import model_selection, naive_bayes, svm from sklearn.tree import DecisionTreeClassifier from sklearn.feature_extraction.text import CountVectorizer from sklearn.feature_extraction.text import TfidfVectorizer from keras.preprocessing.text import Tokenizer from keras.preprocessing.sequence import pad_sequences from sklearn import metrics from sklearn.metrics import accuracy_score from sklearn.metrics import roc_auc_score as AUC

After reading the dataset as a pandas data frame, we create a dataset with id, review, and rating of product for sentiment analysis.

#reading the json file in a list values=[] with open( "Electronics_5.json" , "r" ) as f: for i in f: values.append(json.loads(i)) print(values[: 5 ])

We saved our filtered dataset in the Electronic_review.csv file.

Now we read our Electronic_review data into a data frame:

#read the dataset into a df colnames = [ "id" , "text" , "overall" ] df= pd.read_csv( "Electronic_review.csv" ,names= colnames,header = None)

Populating the data with proper values of sentiments

The division of sentiment, based on vote value, is as follows

0 < Vote < 3 => Negative sentiment (-1)

Vote = 3 => Neutral Sentiment (0)

3 < Vote <= 5 => Positive Sentiment (1)

Let’s save this data frame as processedData.csv.

newdf.to_csv( "processedData.csv" ,chunksize= 100000 )

Let’s see how our processed data look like:

df = pd.read_csv( "processedData.csv" ,nrows = 100000 ) print(df.head( 5 ))

Preprocess the text data samples

let’s import some important libraries:

from nltk.tokenize import word_tokenize from nltk import pos_tag from nltk.corpus import stopwords from nltk.stem import WordNetLemmatizer from sklearn.preprocessing import LabelEncoder from nltk.corpus import wordnet as wn import nltk nltk.download( "stopwords" ) import re nltk.download( "punkt" )

Now read the processedDatat.csv:

df= pd.read_csv(“processedData.csv”)

Stemming algorithms work by cutting off the end of the beginning of the word, taking into account a list of common prefixes and suffixes that can be found in an inflected word. This indiscriminate cutting can be successful on some occasions, but not always, and that is why we affirm that this approach presents some limitations.

Developing a stemmer is far simpler than building a lemmatizer. In the latter, deep linguistics knowledge is required to create dictionaries that allow the algorithm to look for the proper form of the word. Once this is done, the noise will be reduced, and the results provided in the information retrieval process will be more accurate.

lat_df = df[: 100000 ] lat_df.to_csv( "CurrentUsedFile.csv" )

We saved the first 100,000 rows of data as CurrentUsedFile.csv so that we can easily process the data.

Split the dataset into train and test set

#importing the new dataset lat_df = pd.read_csv( "CurrentUsedFile.csv" ) print(lat_df.head( 5 ))

#create x and y => x:textreview , y :sentiment Train_X, Test_X, Train_Y, Test_Y = model_selection.train_test_split(lat_df[ 'reviewText_final' ],lat_df[ 'Sentiment' ],test_size= 0.2 ,random_state = 42 ) print(Train_X.shape,Train_Y.shape) print(Test_X.shape,Test_Y.shape)

Test_Y_binarise

= label_binarize(Test_Y,classes = [0,1,2])

Applying TF-IDF vectorizer to the tokens formed for each of the review samples

# Vectorize the words by using TF-IDF Vectorizer - This is done to find how important a word in document is in comaprison to the df from sklearn.feature_extraction.text import TfidfVectorizer Tfidf_vect = TfidfVectorizer(max_features= 500000 ) #tweak features based on the dataset Tfidf_vect.fit(lat_df[ 'reviewText_final' ]) Train_X_Tfidf = Tfidf_vect.transform(Train_X) Test_X_Tfidf = Tfidf_vect.transform(Test_X)

Applying the SVM, NB, and DT models

Before going ahead, let’s create a model evaluation function:

def modelEvaluation(predictions, y_test_set): #Print model evaluation to predicted result print ( "

Accuracy on validation set: {:.4f}" .format(accuracy_score(y_test_set, predictions))) print ( "

Classification report :

" , metrics.classification_report(y_test_set, predictions)) print ( "

Confusion Matrix :

" , metrics.confusion_matrix(y_test_set, predictions))

Naive Bayes Model:

# Classifier - Algorithm - Naive Bayes # fit the training dataset on the classifier import time second=time.time() Naive = naive_bayes.MultinomialNB() historyNB = Naive.fit(Train_X_Tfidf,Train_Y) # predict the labels on validation dataset predictions_NB = Naive.predict(Test_X_Tfidf) modelEvaluation(predictions_NB, Test_Y)

from sklearn.metrics import precision_recall_fscore_support a,b,c,d = precision_recall_fscore_support(Test_Y, predictions_NB, average= 'macro' ) # Use accuracy_score function to get the accuracy print ( "Naive Bayes Accuracy Score -> " ,accuracy_score(predictions_NB, Test_Y )*100) print ( "Precision is: " ,a ) print ( "Recall is: " ,b ) print ( "F-1 Score is: " ,c )

Support Vector Machine (SVM) Model:

asvm,bsvm,csvm,dsvm = precision_recall_fscore_support(Test_Y, predictions_SVM, average= 'macro' ) # Use accuracy_score function to get the accuracy print ( "SVM Accuracy Score -> " ,accuracy_score(predictions_SVM, Test_Y )*100) print ( "Precision is: " ,asvm ) print ( "Recall is: " ,bsvm )

Decision Tree Model:

third=time.time() decTree = DecisionTreeClassifier() decTree.fit(Train_X_Tfidf, Train_Y) y_decTree_predicted = decTree.predict(Test_X_Tfidf) modelEvaluation(y_decTree_predicted, Test_Y)

Plotting all 3 ROC together

That’s all.

