In this article, we build a machine-learning model to guess the tone of customer reviews based on historical data. It is a classification problem solved with . Natural Language Processing (NLP) The purpose of is to teach languages to computers by developing algorithms and models to allow them to read and understand the text. Furthermore, it allows to generate text. NLP The article contains six parts: Reformat our input data. Text Preprocessing: Transform our data to make it readable by the machine learning program. We use the technique, an implementation of bag-of-words. Text Representation: CountVectorizer Split our data into train and test datasets, then use Logistic Regression to train our model. Training the model: Use tools such as precision, recall and F1-score to evaluate our model. Evaluate the model: Understand the evaluation to improve results. Improve model performance: Pass an unknown review to the program to let it guess the tone (negative, neutral, or positive) Make predictions: In each part, there is a Python snippet explaining how it works. Overall, you will understand the steps to resolve a sentiment analysis classification problem with NLP. Text Preprocessing The first step in our journey is text preprocessing. It means transforming the original text into a more suitable text processable in our machine learning algorithms. Usually, we perform several actions, such as: Natural Language Processing : Breaking down the text into simple words (tokens). It facilitates text processing and analysis. Tokenization : Punctuation does not contribute a lot to the meaning of the sentence. We can remove commas, periods, quotation marks, etc. Remove punctuation : Stopwords are articles, pronouns, and conjunctions. We can remove them as punctuation. Remove stopwords : Usually, we lowercase text to avoid duplicating the same tokens. For example, tokens such as “Beatles” and “Beatles” will be considered the same. Lowercasing/uppercasing Transform words to their base form. For example, “sing” and “sang” becomes “sing”. Stemming: There are other transformations, such as removing special characters, using text representation for numbers, etc. import nltk from nltk.corpus import stopwords from nltk.tokenize import word_tokenize import string nltk.download('punkt') nltk.download('stopwords') def preprocess(text): tokens = word_tokenize(text) tokens = [token.lower() for token in tokens if token not in string.punctuation] stop_words = set(stopwords.words('english')) tokens = [token for token in tokens if token not in stop_words] preprocessed_text = ' '.join(tokens) return preprocessed_text We start by importing the necessary libraries, , and . We write our preprocessing function to perform the tokenization, removing punctuations, lowercasing, and removing stopwords. Finally, we join tokens to form a sentence. stopwords word_tokenize string With the following text example: text = "They have built a home, sweet home with a couple of kids running in the yard" It transforms the text raw to “ ”. built home sweet home couple kids running yard Text Representation Computer programs understand numbers better than words. We need to transform our data into numbers. There are several solutions, I will use to split data into vectors. CountVectorizer from sklearn.feature_extraction.text import CountVectorizer data = [ ("I like their pedagogy and training programs", "positive"), ("There are a lot of great features.", "positive"), ("I do not recommend this school.", "negative"), ("It is a classic school, nothing special.", "neutral") ] texts, labels = zip(*data) preprocessed_texts = [preprocess(text) for text in texts] countVectorizer = CountVectorizer() X = countVectorizer.fit_transform(preprocessed_texts) To illustrate our use case, we save our 4 customer reviews. 2 positives, 1 negative and 1 neutral inside a variable. We create our CountVectorizer and then call method. the fit_transform learns the vocabulary of the input data by analyzing the text and identifying unique words. After learning it, it transforms the input data into numerical representation creating a matrix that contains the number of recurrences for each word in the vocabulary. fit_transform With our example, printing X matrix gives: print(countVectorizer.get_feature_names_out()) print(X.toarray()) #print below ['and' 'are' 'classic' 'do' 'features' 'great' 'is' 'it' 'like' 'lot' 'not' 'nothing' 'of' 'pedagogy' 'programs' 'recommend' 'school' 'special' 'their' 'there' 'this' 'training'] [[1 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 0 1 0 0 1] [0 1 0 0 1 1 0 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0] [0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 0 1 0] [0 0 1 0 0 0 1 1 0 0 0 1 0 0 0 0 1 1 0 0 0 0]] Our matrix X contains a bunch of and **1.**To understand better, we will take the first element: Corresponding to our first sentence: “ ”. 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 0 1 0 0 1 I like their pedagogy and training programs We have 22 words in our vocabulary, the first word is “ ”. It appears one time inside the first sentence, so we save . and 1 The second word in our vocabulary is “ ” and does not appear inside the first sentence, so we save . For each sentence, we continue as described before. are 0 Training the model Now, we are going to train and evaluate our model. We use Logistic Regression. To have accurate results, I create my dataset available It contains 50 positive reviews, 50 negatives, and 10 neutrals. product_reviews.csv, . here So, we just need to replace the data variable we previously hardcoded with the new dataset. from sklearn.feature_extraction.text import CountVectorizer import pandas as pd import csv !wget -cv https://raw.githubusercontent.com/walterwhites/machine_learning/main/customer_reviews.csv data = pd.read_csv('customer_reviews.csv') texts = data['review'].tolist() feeling = data['feeling'].tolist() countVectorizer = CountVectorizer() X = countVectorizer.fit_transform(texts) Then we split our data into test and train datasets and call the fit method to train the model. I choose to include 30% of the data in the training dataset. from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.metrics import classification_report X_train, X_test, y_train, y_test = train_test_split(X, feeling, test_size=0.3, random_state=40) model = LogisticRegression() model.fit(X_train, y_train) Evaluate the model After training our model, we need to evaluate our model, we can use classification_report. y_pred = model.predict(X_test) print(classification_report(y_test, y_pred)) It returns four metrics: : Precision is the ratio of accurate prediction. In our use case, it means when the model predicts a negative, it succeeds 42% of the time. Precision : The ratio of correctly predicted positive observations to the total actual positive observations. For 78% positive feelings, it means the model finds 78% of the positive feelings inside the dataset. Recall : The F1-score is the mean of precision and recall. It gives good insights when there is a class imbalance. F1-score : Support indicates the number of samples belonging to each class. Support We see that the evaluation is not that great. By analyzing a little bit, we understand it is due to our poor dataset. It does not contain a lot of reviews. Improve model performance To improve the performance of our model, we need more data. I prepared a wider dataset: https://github.com/walterwhites/machine_learning/blob/main/customer_reviews_wide.csv We adapt the code to import the new dataset. Then we rerun the next steps. !wget -cv https://raw.githubusercontent.com/walterwhites/machine_learning/main/customer_reviews_wide.csv data = pd.read_csv('customer_reviews_wide.csv') Now we re-evaluate our model. The results are much better. The program succeeds in detecting the feeling for almost each customer review. With a f1-score up to 0.96 and an accuracy of 0.95. Make Prediction We are going to test our model with unseen reviews. def predict_feeling(text): preprocessed_text = preprocess(text) text_representation = countVectorizer.transform([preprocessed_text]) feeling = model.predict(text_representation)[0] return feeling john_feeling = "This website is normal." paul_feeling = "I am so happy, the product I received is exceptional." george_feeling = "I did not like the product I received, I asked for a refund" predicted_feeling_john = predict_feeling(john_feeling) predicted_feeling_paul = predict_feeling(paul_feeling) predicted_feeling_george = predict_feeling(george_feeling) print(predicted_feeling_john) print(predicted_feeling_paul) print(predicted_feeling_george) The program qualifies this review as neutral. “This website is normal.”: The program qualifies this review as positive. “I am so happy, the product I received is exceptional.”: The program qualifies this review as negative. “I did not like the product I received, I asked for a refund”: You can retrieve the full code inside my Github repo: https://github.com/walterwhites/machine_learning/blob/main/Analyse Customer Reviews with Natural Language Processing(NLP).ipynb Conclusion We saw how to resolve our classification problem using . We use CountVectorizer and logistic regression. We could go further, handling more complexity in our data using the BERT method ( ). NLP Bidirectional Encoder Representations from Transformers We use , and . They provide good insights into the performance of our model based on our dataset. precision recall, F1-Score Even with a good classification score (e.g., equal to 1), it does not mean the model is perfect. We still can struggle with issue. the Overfitting To resolve the overfitting issue, you may use , improve the dataset quality or use other algorithms (e.g., according to your use case and input data. BERT Support Vector Machine (SVM), Random Forest, etc.) Also published here.