A step by step tutorial to analyse sentiment of Amazon product reviews with the FastText API This blog provides a detailed step-by-step tutorial to use FastText for the purpose of text classification. For this purpose, we choose to perform sentiment analysis of customer reviews on Amazon.com and also elaborate on how the reviews of a particular product can be scraped for performing sentiment analysis on them hands on, the results of which may be analysed to decide the quality of a product based on the given feedback, before purchase. What is FastText? Text classification has become an essential component of the commercial world; whether it is used in spam filtering or in analysing sentiments of tweet sor customer reviews for E-Commerce websites, which are perhaps the most ubiquitous examples. FastText is an open-source library developed by the Facebook AI Research (FAIR), exclusively dedicated to the purpose of simplifying text classification. FastText is capable of training with millions of example text data in hardly ten minutes over a multi-core CPU and perform prediction on raw unseen text among more than 300,000 categories in less than five minutes using the trained model. Pre-Labelled Dataset for Training A manually annotated dataset of amazon reviews obtained from Kaggle.com containing few million reviews was collected and used for training the model after conversion to FastText format. The data format for FastText is as follows: __label__ __label__ ... < > X < > Y < > Text where X and Y represent the class labels. In the dataset we use, we have the review title prepended to the review, separated by a ‘ ’ and a space. : A sample from the training data file is given below, the datasets for training and testing the models can be found here in the website. Kaggle.com __label__2 Great CD: My lovely Pat has one the GREAT voices her generation. I have listened this CD YEARS I still LOVE IT. I makes feel better. A bad mood just evaporates sugar the rain. This CD just oozes LIFE. Vocals are jusat STUUNNING lyrics just kill. One life gems. This a desert isle CD my book. Why she never made it big just beyond . everytime I play this, no matter black, white, young, old, male, female EVERYBODY says one thing of of to for and When 'm in a good mood it me like in and of 's hidden is in is me "Who was that singing ?" Here, we have only two classes 1 and 2, where signifies that the reviewer gave either 1 or 2 stars for the product, while indicates a 4 or 5 star rating. __label__1 __label__2 Training FastText for Text Classification Pre-process and Clean Data Execute the following command to generate a preprocessed and cleaned training data file after normalizing text case and removing unwanted characters. <path training > | sed - “s/\([.\!?,’/()]\)/ \ /g” | “[:upper:]” “[:lower:]” > <path -processed output > cat to file e 1 tr to pre file Setup FastText Let us start by downloading the : most recent release $ wget https://github.com/facebookresearch/fastText/archive/v0.1.0.zip $ unzip v0.1.0.zip Move to the fastText directory and build it: $ fastText-0.1.0 cd $ make Running the binary without any argument will print the high level documentation, showing the different use cases supported by fastText: >> ./fasttext usage: fasttext < > < > The commands supported fasttext are: supervised train supervised classifier quantize quantize model reduce memory usage test evaluate supervised classifier predict predict most likely labels predict-prob predict most likely labels probabilities skipgram train skipgram model cbow train cbow model print- -vectors print vectors given trained model print- -vectors print vectors given trained model nn query nearest neighbors analogies query analogies command args by a a to the a with a a word word a sentence sentence a for for In this tutorial, we mainly use the , and subcommands, which corresponds to learning (and using) text classifier. supervised test predict Training the model The following command is used to train a model for text classification: ./fasttext supervised - < pre-processed training file> -output < save model> -label __label__ input path to path to The command line option refers to the training file, while the option refers to the location where the model is to be saved. After training is complete, a file , containing the trained classifier, is created in the given location. -input -output model.bin Optional parameters for improving models Increasing number of epochs for training By default, the model is trained on each example for 5 epochs, to increase this parameter for better training, we can specify the -epoch argument. Example: ./fasttext supervised - < pre-processed training file> -output < save model> -label __label__ -epoch input path to path to 50 Specify learning rate Changing learning rate implies changing the learning speed of our model is to increase (or decrease) the learning rate of the algorithm. This corresponds to how much the model changes after processing each example. A learning rate of 0 would means that the model does not change at all, and thus, does not learn anything. Good values of the learning rate are in the range . 0.1 - 1.0 The default value of lr is 0.1. Here’s how we specify this parameter. ./fasttext supervised - <path -processed training > -output <path save model> -label __label__ - input to pre file to lr 0.5 Using n-grams as features This is a useful step for problems depending on word order, especially sentiment analysis. It is to specify the usage of the concatenation of consecutive tokens in a n-sized window as features for training. We specify parameter for this (ideally value between 2 to 5): -wordNgrams ./fasttext supervised - < pre-processed training file> -output < save model> -label __label__ -wordNgrams input path to path to 3 Test and Evaluate the Model The following command is to test the model on a pre-annotated test dataset and compare the original labels with the predicted labels of each review and generate evaluation scores in the form of precision and recall values. The precision is the number of correct labels among the labels predicted by fastText. The recall is the number of labels that successfully were predicted. ./fasttext test < model> < test file> k path to path to where the parameter represents that the model is to predict the top labels for each review. k k The results obtained on evaluating our trained model on a test data of 400000 reviews are as follows . As observed, a precision, recall of 91% is obtained and the model is trained in a very quick time. N Number of examples: 400000 P@ 1 0.913 R@ 1 0.913 400000 Analyse Sentiments of Real-Time Customer Reviews of Products on Amazon.com Scrape Amazon Customer Reviews We use an existing python library to scrape reviews from pages. To setup the module, In your command prompt/terminal type: pip amazon-review-scraper install Here’s a sample code to scrape review of a particular product, given the url of the web page: amazon_review_scraper amazon_review_scraper url = input( ) start_page = input( ) end_page = input( ) time_upper_limit = input( ) file_name = scraper = amazon_review_scraper.amazon_review_scraper(url, start_page, end_page, time_upper_limit) scraper.scrape() scraper.write_csv(file_name) from import "Enter URL: " "Enter Start Page: " "Enter End Page: " "Enter upper limit of time range (Example: Entering the value 5 would mean the program will wait anywhere from 0 to 5 seconds before scraping a page. If you don't want the program to wait, enter 0): " "amazon_product_review" : While entering the URL of the customer review page of a particular product, ensure that you append if it does not exist already, for the scraper to function properly. NOTE &pageNumber=1 The above code scrapes the reviews from the given url and creates an output csv file in the following format: ( Dataset Link ) From the above csv file, we extract the Title and the Body and append them together separated by a and a space as in the training file, and store them in a separate txt file for prediction of sentiments. ‘: Prediction of Sentiments of Scraped Data ./fasttext predict < model> < test file> k > < prediction file> path to path to path to where signifies that the model will predict the top labels for each review. k k The labels predicted for the above reviews are as follows: 2 1 2 2 2 2 2 2 1 2 2 __label__ __label__ __label__ __label__ __label__ __label__ __label__ __label__ __label__ __label__ __label__ Which are quite accurate as verified manually. The prediction file can then be used for further detailed analysis and visualization purposes. Thus, in this blog, we learnt using the FastText API for text classification, scraping Amazon Customer Reviews for a Given Product and predicting their sentiments with the trained model for analysis. If you have any queries or suggestions, I would love to hear about it. Please write to me at abhishek.narayanan@dataturks.com.