I get often asked how to start with Machine Learning. But then I consider myself a maker. I truly believe experience is key and solving actual real world applications are the key to unlock the mysteries. Once you have your first success at building and understanding a solution that solved your problem you can dig deeper and refine the building blocks.
The problem that we faced, (see Part 1), was that we have a multitude of data vendors providing us with event information. All of these vendors use different classifications and classification systems for their event data. Data quality is not that good and in some cases there are even different classifications for the same event coming from the same vendor. I believe many company world-wide experience the same problem in one way or the other. The problem is, how can we maintain a high quality data-set to ensure a good experience for our app users?
Manually and daily refining mapping tables for new categories is not feasible without hiring. Since we are still a small team and can’t manually maintain global mapping tables, we choose to apply a combination of data science, statistics, and algorithms commonly known as machine learning to solve the problem for us globally, reliably, swiftly, and automatically.
In this article we will be covering the following steps that guided us through our journey.
Our data is stored in a common MySQL database. Might be boring, but MySQL is a proven technology that scales well. The tables in the DB are populated once a drop!in app user requests data for a given city. If we can not find an dataset in the cached table or the dataset is expired we re-query the data vendors for an update. The data we source usually includes only the next 48 hours to minimize the search volume and thus the request time.
Since there are many data vendors we are getting a variety of data formats from these vendors. For this analysis we are focusing on the event descriptions and classifications that we receive from the data vendors and how we can make sense from it. We will extract the data from our MySQL server using the Python MySQL connector.
First we have to connect to the DB and setup a cursor.
import MySQLdb as Database
db = MySQLdb.connect(host=”localhost”, port=3306, user=”***”, passwd=”*****”, db=”****”)
cursor = db.cursor(MySQLdb.cursors.DictCursor)
Defining the cursor as a DictCursor returns a dictionary. This is important since we will not be using tuples for the data preparation.
Further, we will be working with UTF-8 encoded strings in Japanese, Chinese, German, French, Spanish, English and several other languages. That’s why we need to ensure the connection and character set matches the data.
cursor.execute(‘SET NAMES utf8;’)
cursor.execute(“SET CHARACTER SET utf8;”)
You don’t want that the algorithm breaks on irregularly encoded
information. Both of these seem minor and unimportant but will throw violent error messages stealing likely hours of your development time when you run into them.
Next you need to define how large your sample size should be helps you test the loose ends before you start running your algorithm over the full data set likely taking hours of your time.
sampleSize = 10000000
cursor.execute(“””SELECT distinct id, title,description, category_id, category_name FROM databasename.tblname
limit %s “”” , (sampleSize,))
except MySQLdb.Error, e:
print “MySQL Error: %s” % str(e)
Here again, the little comma after the sample size is easy overlooked. Why the tuple you might ask? Because the DB API requires it. That’s why. Further you should always use a try/except block when working with MySQL cursors to make sure you are handling the MySQL errors in the right way.
Now that we can receive the data from the DB we can have a more detailed look at the classification data we get from the vendors and we soon notice a beautiful chaos of words and categories. In our sample we observed roughly 500 distinct event categories. Since drop!in is used by thousands of people daily everywhere on the planet, how can we now ensure that all of them get a consistent and easy to use event mapping ?
Therefore we need to map the 500 event categories to a predefined number of categories. We chose 12 categories because based on our data research in information density 12 is the optimal number of categories to present in a mobile app.
So for the app and the initial expert mapping we had defined 12 meaningful categories and one ‘others’ category that was supposed to catch the fall-backs. In data, similar to Accounting standards ‘others’ is never a good thing because it has no real meaning. Still we can observe a large population ending up in ‘other’. Not a good thing.
For creating the predictive model, we will be excluding this ‘Other’ segment from the analysis. In the end, we don’t want the ‘Other’ segment to exist. So the predictive model should be able to bin it into a real segment. In addition, we will exclude all events where the description is less than 15 characters to make sure that the algorithm has enough words to choose from.
Since we now have a useful sample how can we get from the text to the predictive model?
The missing link between machine learning models and our classification problem is that the data is not numerical. So the statistical methods can not work with them. Therefore, the next problem we have to solve is how can we convert our text into numbers that have meaning and are able to differentiate? The Vector Space Model is the solution to this problem.
VSM is an algebraic model representing textual information as a vector. A simplistic explanation could be this. Assume you have a given dictionary of classifications (“Arts”, “Entertainment”, “Sports”,”Technology”) then you could encode these 4 events as the vectors :
Arts = (0,0,0,1)
Entertainment = (0,0,1,0)
Sports = (0,1,0,0)
Technology = (1,0,0,0)
However, this assignment of vectors does not help much for the classification problem since neither of these vectors represent the importance of a term. In addition, working off the already existing classifications will not help as well since we can not extract sufficient differentiation from these short words. This is where the event descriptions come in. Every single event in the data set has a descriptive element to it with up to 2048 characters in length. From this document corpus we get sentences. These sentences can be converted into the vectors we need.
One random example we get from the sample could look like this :
With her US live show debut at Coachella, her debut album Run, debuting at #1 on the US Electronic Billboard charts, a US tour taking in stops in NYC, Miami and Chicago with EDC Las Vegas, Lollapalooza and a HUGE sold out Australian Warehouse tour to come, 2015 is a year to remember for Alison Wonderland.The way you hear her tell it, Alison Wonderland has wanted to name her first album Run since before she knew how to play an instrument.
We are using the already introduced Term Frequency — Inverse Document Frequency (TF-IDF) method.
As already outlined in Part 1, TF-IDF is a well known method to evaluate the relative importance of a word in a document corpus.
In Python, a great TF-IDF vectorizer comes handily as part of the sklearn package. So, for our Python implementation we import first the relevant packages, plus the TF-IFD vectorizer. In addition to the sklearn package we will import cPickle which will help us store the completed model back in the db (LongBlob), and model_selection to split our original dataset into a training and a testing dataset. NumPy is a fundamental package for scientific computing in Python.
import numpy as np
from sklearn import model_selection
from sklearn.feature_extraction.text import TfidfVectorizer
Before we can start working with the TF-IDF Vectorizer we need to prepare the data so the algorithm can function correctly. TF-IDF expects to get a list as input. Since we are in the domain of supervised learning we need to provide the algorithm with X (The dataset with the features) and Y (the dataset with the labels). Each list has to be the exact length of the observations found in the sample dataset from the database.
So the first thing after we pushed the entire sample into a dictionary result-set is initialize the X and Y list variables. In addition, we keep track of the cursor state with the counter variable ‘i’.
X = [None] * cursor.rowcount
Y = [None] * cursor.rowcount
Once this is done we simply loop over the result set and pass the data from the variables in the lists. A few things to note here. It took us some tries to get the right mix of descriptive data together. In the end we kept ‘description’ as it has the relevant terms ‘Coachella’ or ‘EDM’. Secondly, we pass the vendors classification at it’s highest granularity level.
for row in result:
X[i]=row[‘description’] +” “ +row[‘category_name’].replace(“/”, “ “)
If you take ‘Recreation/Arts and Crafts/Leather works from the above example. This step creates three distinct items from this list ‘Recreation’, ‘Arts and Crafts’, and ‘Leather works’. The column ‘cat’ holds our previously confirmed correctly working mapping.
In the next step we randomly split the feature set into a training and a verification (testing) sample.
test_size = 0.5
seed = 7
X_train, X_test, Y_train, Y_test = model_selection.train_test_split(X, Y, test_size=test_size, random_state=seed)
Test_size should be a number between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. The seed variable is used for the random sampling from the dataset. In the first shot we set the test_size to 0.5 so we can evenly split the dataset. The train_test_split function returns 4 datasets X_train (the training features data), X_test (the testing features data), Y_train (the training labels), and Y_test (the testing labels)
Now we have prepared the datasets to train our first model. First we create a TF-IDF Vectorizer object. The overall task of this object will be to convert our list of event descriptions to a matrix of TF-IDF features. We are not passing an encoding since the default is ‘UTF-8’, fortunately we had prepared above our data to be in that encoding. The strip_accents removes all accents from the event descriptions so ‘Montréal’ becomes ‘Montreal’.
vectorizer = TfidfVectorizer(strip_accents=’unicode’)
X_vectors = vectorizer.fit_transform(X_train)
The second line here uses the defined TF-IDF vectorizer and fit_transforms the training feature data-set. I.e., the algorithm learns the vocabulary and idf’s and then returns a term-document matrix. The term document matrix describes which words exist in which document in the sample and has as value the calculated IF-IDF score.
Now one major portion of our project is done. We have successfully created numerical expressions from our text data. As the next step we have to use these numerical expressions of our texts to build our forecasting model. In this example we will use the Multinomial Naive Bayes (MNB) classifier.
The MNB is a simple probabilistic classifiers which applies Bayes’ theorem. Bayes Theorem helps us to predict the sequence of words that have led to the given classification. The benefit of using Bayes’ is that this method assumes a strong independence assumptions between the features.
classifier = MultinomialNB().fit(X_vectors, Y_train)
train_score = classifier.score(X_vectors, Y_train)
print “Scored an accuracy of %s “ % (train_score)
> Scored an accuracy of 0.93206467363
The training function uses the previously created vectors plus the related labels to create the classification. Once this is done we can score the result. The scoring function returns the mean accuracy on the given test data and labels.
In our example the accuracy reaches 93.2% which is better than what we had hoped for.
pickledModel = cPickle.dumps(classifier)
filename = ‘finalized_model.sav’
cPickle.dump(classifier, open(filename, ‘wb’))
Since we are satisfied with the results we can store the model to save it for later.
Thank you for reading all the way down to here. In part 3 (coming soon) we will show how to bring this into real world production.
This article was brought to you by tenqyu, a startups making urban living more fun, healthy, inclusive, and thriving using big data, machine learning, and LOTs of creativity.