Hey Folks! In this article, I walk you through the sentiment analysis of . Amazon Electronics Product Reviews The dataset Before we move forward, let’s download the dataset that we'll use in this project. . The download size of the dataset is 1.2GB. The dataset is zipped, so you first need to unzip it. Now the size of the dataset is around 2.5GB. It may be possible that this dataset would not open in your Microsoft Excel. You can download the dataset from here: http://deepyeti.ucsd.edu/jianmo/amazon/categoryFilesSmall/Electronics_5.json.gz If you still want to open you can use Delimit software for it. Here is the download link: . http://delimitware.com/download.html Let’s analyze the dataset The dataset contains these columns/features: reviewerID — ID of the reviewer, e.g. A2SUAM1J3GNN3B asin — ID of the product, e.g. 0000013714 reviewerName — name of the reviewer vote — helpful votes of the review style — product metadata, e.g., “Format” is “Hardcover” reviewText — text of the review overall — rating of the product summary — summary of the review unixReviewTime — time of the review (unix time) reviewTime — time of the review (raw) image — images that users post after they have received the product The dataset has lots of features, but for sentiment analysis, we need and . review rating Importing the libraries and the data numpy np pandas pd random os json sys gzip collections defaultdict csv time nltk.tokenize word_tokenize nltk pos_tag nltk.corpus stopwords nltk.stem WordNetLemmatizer sklearn.preprocessing LabelEncoder nltk.corpus wordnet wn sklearn.model_selection train_test_split sklearn.linear_model LogisticRegression LR sklearn.ensemble RandomForestClassifier sklearn model_selection, naive_bayes, svm sklearn.tree DecisionTreeClassifier sklearn.feature_extraction.text CountVectorizer sklearn.feature_extraction.text TfidfVectorizer keras.preprocessing.text Tokenizer keras.preprocessing.sequence pad_sequences sklearn metrics sklearn.metrics accuracy_score sklearn.metrics roc_auc_score AUC import as import as import import import import import from import import import #nltk libraries and packages from import from import from import from import from import from import as #Ml related libraries from import from import as from import from import from import from import from import from import from import from import from import from import as After reading the dataset as a pandas data frame, we create a dataset with id, review, and rating of product for sentiment analysis. #reading the json file a list values=[] open( , ) f: i f: values.append(json.loads(i)) print(values[: ]) in with "Electronics_5.json" "r" as for in 5 We saved our filtered dataset in the file. Electronic_review.csv Now we read our Electronic_review data into a data frame: #read the dataset into a df colnames = [ , , ] df= pd.read_csv( ,names= colnames,header = None) "id" "text" "overall" "Electronic_review.csv" Populating the data with proper values of sentiments The division of sentiment, based on vote value, is as follows 0 < Vote < 3 => Negative sentiment (-1) Vote = 3 => Neutral Sentiment (0) 3 < Vote <= 5 => Positive Sentiment (1) Let’s save this data frame as processedData.csv. newdf.to_csv( ,chunksize= ) "processedData.csv" 100000 Let’s see how our processed data look like: df = pd.read_csv( ,nrows = ) print(df.head( )) "processedData.csv" 100000 5 Preprocess the text data samples let’s import some important libraries: nltk.tokenize word_tokenize nltk pos_tag nltk.corpus stopwords nltk.stem WordNetLemmatizer sklearn.preprocessing LabelEncoder nltk.corpus wordnet wn nltk nltk.download( ) re nltk.download( ) from import from import from import from import from import from import as import "stopwords" import "punkt" Now read the processedDatat.csv: df= pd.read_csv(“processedData.csv”) algorithms work by cutting off the end of the beginning of the word, taking into account a list of common prefixes and suffixes that can be found in an inflected word. This indiscriminate cutting can be successful on some occasions, but not always, and that is why we affirm that this approach presents some limitations. Stemming Developing a stemmer is far simpler than building a lemmatizer. In the latter, deep linguistics knowledge is required to create dictionaries that allow the algorithm to look for the proper form of the word. Once this is done, the noise will be reduced, and the results provided in the information retrieval process will be more accurate. lat_df = df[: ] lat_df.to_csv( ) 100000 "CurrentUsedFile.csv" We saved the first 100,000 rows of data as CurrentUsedFile.csv so that we can easily process the data. Split the dataset into train and test set #importing the dataset lat_df = pd.read_csv( ) print(lat_df.head( )) new "CurrentUsedFile.csv" 5 #create x and y => x:textreview , :sentiment Train_X, Test_X, Train_Y, Test_Y = model_selection.train_test_split(lat_df[ ],lat_df[ ],test_size= ,random_state = ) print(Train_X.shape,Train_Y.shape) print(Test_X.shape,Test_Y.shape) y 'reviewText_final' 'Sentiment' 0.2 42 Test_Y_binarise = label_binarize(Test_Y,classes = [0,1,2]) Applying TF-IDF vectorizer to the tokens formed for each of the review samples # Vectorize the words by using TF-IDF Vectorizer - This is done to find how important a word is comaprison to the df sklearn.feature_extraction.text TfidfVectorizer Tfidf_vect = TfidfVectorizer(max_features= ) #tweak features based on the dataset Tfidf_vect.fit(lat_df[ ]) Train_X_Tfidf = Tfidf_vect.transform(Train_X) Test_X_Tfidf = Tfidf_vect.transform(Test_X) in document in from import 500000 'reviewText_final' Applying the SVM, NB, and DT models Before going ahead, let’s create a model evaluation function: def modelEvaluation(predictions, y_test_set): #Print model evaluation to predicted result print ( .format(accuracy_score(y_test_set, predictions))) print ( , metrics.classification_report(y_test_set, predictions)) print ( , metrics.confusion_matrix(y_test_set, predictions)) "\nAccuracy on validation set: {:.4f}" "\nClassification report : \n" "\nConfusion Matrix : \n" Naive Bayes Model: # Classifier - Algorithm - Naive Bayes # fit the training dataset on the classifier time second=time.time() Naive = naive_bayes.MultinomialNB() historyNB = Naive.fit(Train_X_Tfidf,Train_Y) # predict the labels on validation dataset predictions_NB = Naive.predict(Test_X_Tfidf) modelEvaluation(predictions_NB, Test_Y) import sklearn.metrics precision_recall_fscore_support a,b,c,d = precision_recall_fscore_support(Test_Y, predictions_NB, average= ) # Use accuracy_score from import 'macro' ( )*100) ( ) ( ) ( ) function to get the accuracy print ,accuracy_score(predictions_NB, Test_Y "Naive Bayes Accuracy Score -> " print ,a "Precision is: " print ,b "Recall is: " print ,c "F-1 Score is: " Support Vector Machine (SVM) Model: asvm,bsvm,csvm,dsvm = precision_recall_fscore_support(Test_Y, predictions_SVM, average= ) # Use accuracy_score 'macro' ( )*100) ( ) ( ) function to get the accuracy print ,accuracy_score(predictions_SVM, Test_Y "SVM Accuracy Score -> " print ,asvm "Precision is: " print ,bsvm "Recall is: " Decision Tree Model: third=time.time() decTree = DecisionTreeClassifier() decTree.fit(Train_X_Tfidf, Train_Y) y_decTree_predicted = decTree.predict(Test_X_Tfidf) modelEvaluation(y_decTree_predicted, Test_Y) Plotting all 3 ROC together That’s all. Also published on Medium's sameerbairwa