Since the dawn of machine learning, imbalanced datasets have been plaguing the workdays of anyone working on classification problems.
In truth, handling an imbalanced data set isn’t that complicated if you know what you’re doing. In case you do not, I’ll introduce you to the concept of classification problems, talk about real life examples, and share efficient ways to handle imbalanced data.
For those who are just now learning about… machine learning… I’ll include this short introduction to classification problems. It will help you understand the general concept of the article better.
Classification problems are supervised learning problems that classify data to multiple categories (class labels). If only two such categories exist, then it’s called binary classification. For multiple class labels, we use the term multinomial classification.
The most common example of binary classification is when determining whether an email is spam or not. A classifier will label emails based on the sample data it was given. Usually, it will deduce that emails containing certain phrases such as the infamous “Nigerian Prince” are more likely to be spam than not.
A multinomial classification algorithm would do the same, but would have to be more complex and capable of labeling emails as promotional material, news, or any other category you come up with.
Now that we know what classification problems are, let’s focus on imbalanced datasets.
When a single class label has a significantly lower observation rate than other class labels in the same dataset, we label that dataset as imbalanced.
Using our previous email example, if the classifier encountered only a couple of spam emails upon examining thousands then you have an imbalanced dataset on your hands. In practice, about 50% of all emails sent are spam. This means that the classifier was either examining a set of emails that have already been curated, or it’s simply not working properly.
There are dozens of way to manage imbalanced datasets, but I’ll focus on explaining the most popular ones. These can be used to deal with 99.9% of data imbalance issues you may encounter.
But what about the 0.01%? How do I deal with those?
Most people will never encounter such specific examples of imbalanced data. If you do, then you’ll need years of experience to build a custom solution. A simple article, no matter how brilliantly written, wouldn’t help you at all.
Undersampling means you are reducing the size of the majority classes. It’s a fairly simple method of handling imbalanced data but works great when dealing with binary classification. The most effective way to undersample is to keep all examples from the rare class and randomly choose examples from the abundant class.
Oversampling works the other way around. You keep all samples from the abundant class but add several new copies of the entire rare class to the dataset. Similar to undersampling, oversampling is most effective when dealing with binary classification problems.
Both methods ensure you create a new dataset that’s balanced and easier to work with. Your classifier will be able to learn more and better predict future outcomes by going through more “bad” samples.
I am not a huge fan of this technique because I find it overly complicated for no good reason. However, many folks would want to put me in the data science jail if I didn’t introduce it here so let’s take a quick gander.
A parameter called class_weight is present in almost every popular classifier model library. The default value of class_weight is 0, so the model automatically assigns weight to different classes that are inversely proportionate to abundance. By skewing class_weight parameters and applying the log loss function as the cost basis, you can “punish” your classifier much more severely when it misclassifies a rare sample compared to misclassifying an abundant one.
While the above two techniques have dealt with data presented to the classifier, this one modifies the classifier itself. First, you need to decide what needs to be changed and why. This is done by using evaluation metrics.
Simply put, evaluation metrics are what’s used to modify the quality of a machine learning model. While many exist, the five I believe to be absolutely essential are:
Let’s use another hypothetical example. Our model is trying to predict whether Oprah is ever going to become the God-Empress of our planet. Hopefully, the model says NO every time based on how you built it.
Accuracy is the prime evaluation criteria, so that’s the first thing we’ll use to evaluate our model.
A = (TN+TP)/(FP+TP+TN+FN)
TP - True Positive
TN - True Negative
FP - False Positive
FN - False Negative
By applying the formula above, you get 99.5%. Your model is 99.5% accurate. Was this helpful? Not in the slightest. You still have an imbalanced data set on your hands and you know that it’s accurate. But it doesn’t help you predict anything.
Precision comes next.
P = (TP)/(FP+TP)
It’s used to check how many of the predicted positives were true positives. Since we never predicted a positive, it’s useless. Essentially, it says our model is not good but doesn’t help us whatsoever.
On to recall.
R = (TP)/(TP+FN)
We use recall to check how many of the true positives have proper class labels. Since we got no true positives, we can’t use it.
Next up we have the F1 Score, it’s a personal favorite of mine. If you wondered why I bothered writing out the examples for precision and recall above, we’ll get to that shortly. The F1 score is always equal to a number between 0 and 1 and it represents the harmonic mean of precision and recall.
F1 = 2(P*R/P+R)
When the formula is applied to our Oprah problem, we get 0. F1 being a fairly advanced evaluation metric, we can say that our model is completely useless for this particular set of data and decide to handle our imbalanced dataset problem by building a different model.
If you’re a professional data analyst and you encounter a real life example of imbalanced data, it’s usually going to be worth looking into.
A great example is encountered in finance. If you work for a bank and are trying to predict credit card fraud, you’ll encounter imbalanced data sets. The competition between fraudsters and machine learning models has been going on for a long time and each side keeps finding new ways to “outsmart” the other.
In health, predictive models are being used to better diagnose various diseases and have been saving lives for the past few decades. As you can imagine, the most infrequent diseases create imbalanced datasets and classifiers learn from those to help doctors with diagnoses that would rarely come to mind.
Finally, we have so many examples of imbalanced data sets in marketing that I sometimes believe some solutions were created solely to help marketers sell more services. Customer turnover, refund requests and even legal action taken against companies can all create imbalanced data sets and getting ahead of those is invaluable to top companies.
By the time I finalized writing this article, I deleted my original conclusion (yes, I always write that part first) and decided to write a new one. I did so because I was sure that this will be a short and sweet piece that’s easy to both read and write.
However, I found myself constantly digressing trying to explain different machine learning principles, thinking of new examples, and the like. I even learned a few new things while fact-checking myself when making edits.
I consider myself a machine learning veteran, yet every time I write an article like this, I remind myself what originally got me excited about it. It’s a bottomless pit (in a good way) of theories, problems, and solutions - all created for the betterment of mankind.