Text Classification Simplified with Facebook’s FastText

Written by dataturks | Published 2020/01/22
Tech Story Tags: machine-learning | sentiment-analysis | amazon-reviews | programming | natural-language-processing | datasets | good-company | hackernoon-top-story

TLDR A step by step tutorial to analyse sentiment of Amazon product reviews with the FastText API. This blog provides a detailed step-by-step tutorial to use FastText for the purpose of text classification. FastText is an open-source library developed by the Facebook AI Research (FAIR) It is capable of training with millions of example text data in hardly ten minutes over a multi-core CPU and perform prediction on raw unseen text among more than 300,000 categories in less than five minutes using the trained model.via the TL;DR App

A step by step tutorial to analyse sentiment of Amazon product reviews with the FastText API
This blog provides a detailed step-by-step tutorial to use FastText for the purpose of text classification. For this purpose, we choose to perform sentiment analysis of customer reviews on Amazon.com and also elaborate on how the reviews of a particular product can be scraped for performing sentiment analysis on them hands on, the results of which may be analysed to decide the quality of a product based on the given feedback, before purchase.

What is FastText?

Text classification has become an essential component of the commercial world; whether it is used in spam filtering or in analysing sentiments of tweet sor customer reviews for E-Commerce websites, which are perhaps the most ubiquitous examples.
FastText is an open-source library developed by the Facebook AI Research (FAIR), exclusively dedicated to the purpose of simplifying text classification. FastText is capable of training with millions of example text data in hardly ten minutes over a multi-core CPU and perform prediction on raw unseen text among more than 300,000 categories in less than five minutes using the trained model.

Pre-Labelled Dataset for Training

A manually annotated dataset of amazon reviews obtained from Kaggle.com containing few million reviews was collected and used for training the model after conversion to FastText format.
The data format for FastText is as follows:
__label__<X> __label__<Y> ... <Text>
where X and Y represent the class labels.
In the dataset we use, we have the review title prepended to the review, separated by a ‘
:
’ and a space.
A sample from the training data file is given below, the datasets for training and testing the models can be found here in the Kaggle.com website.
__label__2 Great CD: My lovely Pat has one
of the GREAT voices of her generation. 
I have listened to this CD for YEARS and I
still LOVE IT. When I'm in a good mood it
makes me feel better. A bad mood just
evaporates like sugar in the rain. This CD
just oozes LIFE. Vocals are jusat STUUNNING
and lyrics just kill. One of life's hidden
gems. This is a desert isle CD in my book.
Why she never made it big is just beyond me.
everytime I play this, no matter black,
white, young, old, male, female EVERYBODY
says one thing "Who was that singing ?"
Here, we have only two classes 1 and 2, where
__label__1
signifies that the reviewer gave either 1 or 2 stars for the product, while
__label__2
indicates a 4 or 5 star rating.

Training FastText for Text Classification

Pre-process and Clean Data
Execute the following command to generate a preprocessed and cleaned training data file after normalizing text case and removing unwanted characters.
cat <path to training file> | sed -e 
“s/\([.\!?,’/()]\)/ \1 /g” | tr “[:upper:]”
“[:lower:]” > 
<path to pre-processed output file>
Setup FastText
Let us start by downloading the most recent release:
$ wget https://github.com/facebookresearch/fastText/archive/v0.1.0.zip
$ unzip v0.1.0.zip
Move to the fastText directory and build it:
$ cd fastText-0.1.0
$ make
Running the binary without any argument will print the high level documentation, showing the different use cases supported by fastText:
>> ./fasttext
usage: fasttext <command> <args>
The commands supported by fasttext are:
  supervised              train a supervised classifier
  quantize                quantize a model to reduce the memory usage
  test                    evaluate a supervised classifier
  predict                 predict most likely labels
  predict-prob            predict most likely labels with probabilities
  skipgram                train a skipgram model
  cbow                    train a cbow model
  print-word-vectors      print word vectors given a trained model
  print-sentence-vectors  print sentence vectors given a trained model
  nn                      query for nearest neighbors
  analogies               query for analogies
In this tutorial, we mainly use the
supervised
,
test
and
predict
subcommands, which corresponds to learning (and using) text classifier.
Training the model
The following command is used to train a model for text classification:
./fasttext supervised -input 
<path to pre-processed training file> -output
<path to save model> -label __label__ 
The -input command line option refers to the training file, while the -output option refers to the location where the model is to be saved. After training is complete, a file
model.bin
, containing the trained classifier, is created in the given location.

Optional parameters for improving models

Increasing number of epochs for training
By default, the model is trained on each example for 5 epochs, to increase this parameter for better training, we can specify the -epoch argument.
Example:
./fasttext supervised -input 
<path to pre-processed training file> -output 
<path to save model> -label __label__ -epoch 50
Specify learning rate
Changing learning rate implies changing the learning speed of our model is to increase (or decrease) the learning rate of the algorithm. This corresponds to how much the model changes after processing each example. A learning rate of 0 would means that the model does not change at all, and thus, does not learn anything. Good values of the learning rate are in the range
0.1 - 1.0
.
The default value of lr is 0.1. Here’s how we specify this parameter.
./fasttext supervised -input 
<path to pre-processed training file> -output
 <path to save model> -label __label__ -lr 0.5
Using n-grams as features
This is a useful step for problems depending on word order, especially sentiment analysis. It is to specify the usage of the concatenation of consecutive tokens in a n-sized window as features for training.
We specify
-wordNgrams
parameter for this (ideally value between 2 to 5):
./fasttext supervised -input 
<path to pre-processed training file> -output 
<path to save model> -label __label__ -wordNgrams 3

Test and Evaluate the Model

The following command is to test the model on a pre-annotated test dataset and compare the original labels with the predicted labels of each review and generate evaluation scores in the form of precision and recall values.
The precision is the number of correct labels among the labels predicted by fastText. The recall is the number of labels that successfully were predicted.
./fasttext test <path to model> <path to test file> k
where the parameter k represents that the model is to predict the top k labels for each review.
The results obtained on evaluating our trained model on a test data of 400000 reviews are as follows . As observed, a precision, recall of 91% is obtained and the model is trained in a very quick time.
N 400000
P@1 0.913
R@1 0.913
Number of examples: 400000

Analyse Sentiments of Real-Time Customer Reviews of Products on Amazon.com

Scrape Amazon Customer Reviews
We use an existing python library to scrape reviews from pages.
To setup the module, In your command prompt/terminal type:
pip install amazon-review-scraper
Here’s a sample code to scrape review of a particular product, given the url of the web page:
from amazon_review_scraper import amazon_review_scraper

url = input("Enter URL: ")
start_page = input("Enter Start Page: ")
end_page = input("Enter End Page: ")
time_upper_limit = input("Enter upper limit of time range (Example: Entering the value 5 would mean the program will wait anywhere from 0 to 5 seconds before scraping a page. If you don't want the program to wait, enter 0): ")
file_name = "amazon_product_review"

scraper = amazon_review_scraper.amazon_review_scraper(url, start_page, end_page, time_upper_limit)
scraper.scrape()
scraper.write_csv(file_name)
NOTE: While entering the URL of the customer review page of a particular product, ensure that you append
&pageNumber=1
if it does not exist already, for the scraper to function properly.
The above code scrapes the reviews from the given url and creates an output csv file in the following format:
From the above csv file, we extract the Title and the Body and append them together separated by a
‘:
and a space as in the training file, and store them in a separate txt file for prediction of sentiments.

Prediction of Sentiments of Scraped Data

./fasttext predict <path to model> 
<path to test file> 
k > <path to prediction file>
where k signifies that the model will predict the top k labels for each review.
The labels predicted for the above reviews are as follows:
__label__2
__label__1
__label__2
__label__2
__label__2
__label__2
__label__2
__label__2
__label__1
__label__2
__label__2
Which are quite accurate as verified manually. The prediction file can then be used for further detailed analysis and visualization purposes.
Thus, in this blog, we learnt using the FastText API for text classification, scraping Amazon Customer Reviews for a Given Product and predicting their sentiments with the trained model for analysis.
If you have any queries or suggestions, I would love to hear about it. Please write to me at [email protected].

Published by HackerNoon on 2020/01/22