ML.NET Sentiment Analysis with MongoDB

Earlier this year (May 2018) Microsoft announced ML.NET, an open source and cross-platform machine learning framework built for .NET developers. It is exciting news to be able to integrate custom machine learning with .NET/C# applications. Although ML.NET is still in preview release version 0.5.0 at the time of writing, you can test drive it to explore the potential power of the framework.

There are already a number of tutorials for ML.NET available from Microsoft and third parties. However, the example data sources are mostly flat files in the format of TSV (Tab Separated Values). This post is written for the plethora of datasets available in JSON format, unstructured datasets from web events, or perhaps datasets that are already stored in MongoDB.

This post is going to focus on how to develop ML.NET classification sentiment analysis using data stored in MongoDB. This post is based on Microsoft’s Tutorial: Use ML.NET in a sentiment analysis binary classification with notable differences:

The training dataset is in JSON format.
It reads from MongoDB as its data source instead of a file.
It uses .NET Core (Ubuntu/Linux).

The full code example and data can be found on github.com/sindbach/mlnet_mongodb. I would recommend reviewing Microsoft’s tutorial for more information.

The Data

A good machine learning journey always starts with a good dataset. The dataset used is from Yelp Dataset Challenge. The data is provided by Yelp as part of their dataset challenge, which ends 31st December 2018. The data is ~2.9GB in size and, most importantly, in JSON format.

Part of the dataset that is of interest is in the yelp_academic_dataset_review.json file. The sentiment analysis model will be trained based on the Yelp reviews to predict whether a review has a positive or negative sentiment.

The following is an example JSON structure from the file:

{"business_id": "iCQpiavjjPzJ5_3gPD5Ebg","cool": 0,"date": "2011-02-25","funny": 0,"review_id": "x7mDIiDB3jEiPGPHOmDzyw","stars": 2,"text": "The pizza was okay. Not the best I've had. I prefer Biaggio's on Flamingo / Fort Apache. The chef there can make a MUCH better NY style pizza. The pizzeria @ Cosmo was over priced for the quality and lack of personality in the food. Biaggio's is a much better pick if youre going for italian - family owned, home made recipes, people that actually CARE if you like their food. You dont get that at a pizzeria in a casino. I dont care what you say...","useful": 0,"user_id": "msQe1u7Z_XuqjGoqhB0J5g"}

There are two important fields from the structure: text and stars. The text field contains a user’s review comment, and the stars field contains an indication whether the review is positive or not.

The Database

Time to load the review data into a database. For this post, the data will be loaded into MongoDB Atlas, a cloud hosted database-as-a-service for MongoDB. You can follow MongoDB’s tutorial to create an Atlas FREE tier if you would like to test the data loading as well.

The data can be loaded to MongoDB Atlas using mongoimport. For example, the following command will import a file called yelp_academic_dataset_review.json into the review collection in the yelp database:

mongoimport --uri "mongodb+srv://user:[email protected]/yelp" --collection review ./yelp_academic_dataset_review.json

Once the import has completed, use either the mongo shell or MongoDB Compass to check the data.

MongoDB Compass Document View

There’s one more preparation that needs to be performed before jumping into the code. Since we’re trying to create a binary classification, we need a binary value to determine whether a review is positive / 1 or negative / 0. Fortunately every document contains a star rating, a range of 1 to 5 where a value of 1 indicates a negative review and a value of 5 is a positive review.

The MongoDB Aggregation Pipeline can be used to add a new field called sentiment to the dataset where the value is based on the stars rating. The sentiment value will be determined with the following logic: any review with a stars value greater than 3 is positive, and any value equal or less than 3 is negative.

For example, use the $addFields stage to add the new field and $out stage to store the output into a separate collection:

db.review.aggregate([{“$addFields”:{“sentiment”:{“$cond”:{“if”:{“$gt”:["$stars", 3]},“then”: 1,“else”: 0}}}},{"$out":"review_train"}]);

MongoDB Compass Aggregation Pipeline Builder

Note: You can also find a small portion of the JSON data on github.com/sindbach/mlnet_mongodb: data. The training data consists of 5000 positive reviews and 5000 negative reviews.

The Code

This post will be using .NET Core, a free and open-source managed framework for Windows, macOS and Linux. The only two dependencies for the project are :

MongoDB .NET/C# driver version 2.7.0
ML.NET version 0.5.0

The SentimentData class is modified as follows to serialize and/or deserialize the review document structure from MongoDB:

[BsonIgnoreExtraElements]public class SentimentData{[BsonId][BsonRepresentation(BsonType.ObjectId)]public string Id {get; set;}[BsonElement("sentiment")]public float Label { get; set; }public string text { get; set; }}

BsonIgnoreExtraElements ignores all fields in the document except for id, sentiment (mapped to Label), and text. These are the fields we will use for training. Next, we instantiate a MongoClient object to connect to MongoDB using a connection string URI:

static string mongoURI = "mongodb+srv://usr:[email protected]";static readonly MongoClient client = new MongoClient(mongoURI);

Using the MongoClient object, we can access the data in the yelp database and review_train collection:

var db = client.GetDatabase("yelp");var collection = db.GetCollection<SentimentData>("review_train");

The ML.NET LearningPipeline requires an enumerable object which we can easily get by invoking Find() on collection:

var documents = collection.Find<SentimentData>(new BsonDocument()).ToEnumerable();pipeline.Add(CollectionDataSource.Create(documents));

To test the sentiment analysis model, we’ll fetch four current reviews displayed on Yelp for restaurants in Sydney Australia:

“Very bad service and low quality of coffee too. Waiting for so long even tried to rush them already.”
“This place is amazing!! I had the classic cheese burger with fries. Hands down the best burger I have ever had”.
“If I could give zero stars I would. Terribly overpriced. Dried over cooked barramundi with no seasoning or flavor at all”.
“Small menu but the food is quite good. It’s fast and easy, one of the better options around the area. We had the seafood laksa and seafood Pad Kee Mao”.

The prediction results are:

Sentiment Predictions

---------------------

Sentiment: Very bad service and low quality of coffee too. Waiting for so long even tried to rush them already. | Prediction: Negative

Sentiment: This place is amazing!! I had the classic cheese burger with fries. Hands down the best burger I have ever had | Prediction: Positive

Sentiment: If I could give zero stars I would. Terribly overpriced. Dried over cooked barramundi with no seasoning or flavor at all | Prediction: Negative

Sentiment: Small menu but the food is quite good. It's fast and easy, one of the better options around the area. We had the seafood laksa and seafood Pad Kee Mao | Prediction: Positive

Note: You can find the full code example on github.com/sindbach/mlnet_mongodb: sentiment.

Loading and reading data from MongoDB as a ML.NET data source is quite trivial. The potential of utilising ML.NET to integrate machine learning with datasets stored in MongoDB is exciting, and I’m looking forward to future releases of ML.NET.