A step by step tutorial to analyse sentiment of Amazon product reviews with the FastText API
This blog provides a detailed step-by-step tutorial to use FastText for the purpose of text classification. For this purpose, we choose to perform sentiment analysis of customer reviews on Amazon.com and also elaborate on how the reviews of a particular product can be scraped for performing sentiment analysis on them hands on, the results of which may be analysed to decide the quality of a product based on the given feedback, before purchase.
Text classification has become an essential component of the commercial world; whether it is used in spam filtering or in analysing sentiments of tweet sor customer reviews for E-Commerce websites, which are perhaps the most ubiquitous examples.
FastText is an open-source library developed by the Facebook AI Research (FAIR), exclusively dedicated to the purpose of simplifying text classification. FastText is capable of training with millions of example text data in hardly ten minutes over a multi-core CPU and perform prediction on raw unseen text among more than 300,000 categories in less than five minutes using the trained model.
A manually annotated dataset of amazon reviews obtained from Kaggle.com containing few million reviews was collected and used for training the model after conversion to FastText format.
The data format for FastText is as follows:
__label__<X> __label__<Y> ... <Text>
where X and Y represent the class labels.
In the dataset we use, we have the review title prepended to the review, separated by a ‘
’ and a space.
A sample from the training data file is given below, the datasets for training and testing the models can be found here in the Kaggle.com website.
__label__2 Great CD: My lovely Pat has one of the GREAT voices of her generation. I have listened to this CD for YEARS and I still LOVE IT. When I'm in a good mood it makes me feel better. A bad mood just evaporates like sugar in the rain. This CD just oozes LIFE. Vocals are jusat STUUNNING and lyrics just kill. One of life's hidden gems. This is a desert isle CD in my book. Why she never made it big is just beyond me. everytime I play this, no matter black, white, young, old, male, female EVERYBODY says one thing "Who was that singing ?"
Here, we have only two classes 1 and 2, where
signifies that the reviewer gave either 1 or 2 stars for the product, while
indicates a 4 or 5 star rating.
Pre-process and Clean Data
Execute the following command to generate a preprocessed and cleaned training data file after normalizing text case and removing unwanted characters.
cat <path to training file> | sed -e “s/\([.\!?,’/()]\)/ \1 /g” | tr “[:upper:]” “[:lower:]” > <path to pre-processed output file>
Let us start by downloading the most recent release:
$ wget https://github.com/facebookresearch/fastText/archive/v0.1.0.zip $ unzip v0.1.0.zip
Move to the fastText directory and build it:
$ cd fastText-0.1.0 $ make
Running the binary without any argument will print the high level documentation, showing the different use cases supported by fastText:
>> ./fasttext usage: fasttext <command> <args> The commands supported by fasttext are: supervised train a supervised classifier quantize quantize a model to reduce the memory usage test evaluate a supervised classifier predict predict most likely labels predict-prob predict most likely labels with probabilities skipgram train a skipgram model cbow train a cbow model print-word-vectors print word vectors given a trained model print-sentence-vectors print sentence vectors given a trained model nn query for nearest neighbors analogies query for analogies
In this tutorial, we mainly use the
subcommands, which corresponds to learning (and using) text classifier.
Training the model
The following command is used to train a model for text classification:
./fasttext supervised -input <path to pre-processed training file> -output <path to save model> -label __label__
The -input command line option refers to the training file, while the -output option refers to the location where the model is to be saved. After training is complete, a file
, containing the trained classifier, is created in the given location.
Increasing number of epochs for training
By default, the model is trained on each example for 5 epochs, to increase this parameter for better training, we can specify the -epoch argument.
./fasttext supervised -input <path to pre-processed training file> -output <path to save model> -label __label__ -epoch 50
Specify learning rate
Changing learning rate implies changing the learning speed of our model is to increase (or decrease) the learning rate of the algorithm. This corresponds to how much the model changes after processing each example. A learning rate of 0 would means that the model does not change at all, and thus, does not learn anything. Good values of the learning rate are in the range
0.1 - 1.0
The default value of lr is 0.1. Here’s how we specify this parameter.
./fasttext supervised -input <path to pre-processed training file> -output <path to save model> -label __label__ -lr 0.5
Using n-grams as features
This is a useful step for problems depending on word order, especially sentiment analysis. It is to specify the usage of the concatenation of consecutive tokens in a n-sized window as features for training.
parameter for this (ideally value between 2 to 5):
./fasttext supervised -input <path to pre-processed training file> -output <path to save model> -label __label__ -wordNgrams 3
The following command is to test the model on a pre-annotated test dataset and compare the original labels with the predicted labels of each review and generate evaluation scores in the form of precision and recall values.
The precision is the number of correct labels among the labels predicted by fastText. The recall is the number of labels that successfully were predicted.
./fasttext test <path to model> <path to test file> k
where the parameter k represents that the model is to predict the top k labels for each review.
The results obtained on evaluating our trained model on a test data of 400000 reviews are as follows . As observed, a precision, recall of 91% is obtained and the model is trained in a very quick time.
Scrape Amazon Customer Reviews
We use an existing python library to scrape reviews from pages.
To setup the module, In your command prompt/terminal type:
pip install amazon-review-scraper
Here’s a sample code to scrape review of a particular product, given the url of the web page:
from amazon_review_scraper import amazon_review_scraper url = input("Enter URL: ") start_page = input("Enter Start Page: ") end_page = input("Enter End Page: ") time_upper_limit = input("Enter upper limit of time range (Example: Entering the value 5 would mean the program will wait anywhere from 0 to 5 seconds before scraping a page. If you don't want the program to wait, enter 0): ") file_name = "amazon_product_review" scraper = amazon_review_scraper.amazon_review_scraper(url, start_page, end_page, time_upper_limit) scraper.scrape() scraper.write_csv(file_name)
NOTE: While entering the URL of the customer review page of a particular product, ensure that you append
if it does not exist already, for the scraper to function properly.
The above code scrapes the reviews from the given url and creates an output csv file in the following format:
From the above csv file, we extract the Title and the Body and append them together separated by a
and a space as in the training file, and store them in a separate txt file for prediction of sentiments.
./fasttext predict <path to model> <path to test file> k > <path to prediction file>
where k signifies that the model will predict the top k labels for each review.
The labels predicted for the above reviews are as follows:
__label__2 __label__1 __label__2 __label__2 __label__2 __label__2 __label__2 __label__2 __label__1 __label__2 __label__2
Which are quite accurate as verified manually. The prediction file can then be used for further detailed analysis and visualization purposes.
Thus, in this blog, we learnt using the FastText API for text classification, scraping Amazon Customer Reviews for a Given Product and predicting their sentiments with the trained model for analysis.
If you have any queries or suggestions, I would love to hear about it. Please write to me at [email protected]