paint-brush
What is Data Imbalance in Machine Learning?by@modzy
688 reads
688 reads

What is Data Imbalance in Machine Learning?

by ModzyJune 2nd, 2021
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

Data imbalance is a common problem in machine learning classification where the training dataset contains a disproportionate ratio of samples in each class. Examples of real-world scenarios that suffer from class imbalance include threat detection, medical diagnosis, and spam filtering. At Modzy, we’re conscious of this challenge and have procedures built into our model training processes to minimize the impact of data imbalance. There are many techniques for handling class imbalance during training such as using a data-driven approach (resampling) or an algorithmic approach (ensemble models)

Company Mentioned

Mention Thumbnail
featured image - What is Data Imbalance in Machine Learning?
Modzy HackerNoon profile picture

Data imbalance, or imbalanced classes, is a common problem in machine learning classification where the training dataset contains a disproportionate ratio of samples in each class. Examples of real-world scenarios that suffer from class imbalance include threat detection, medical diagnosis, and spam filtering.

Class imbalance can make training efficient machine learning models difficult, especially when there aren’t enough samples belonging to the class of interest. In the case of fraud detection, the amount of fraudulent transactions is negligible to the number of lawful transactions, making it difficult to train a machine learning model because the training dataset does not contain enough information about fraud.

However, there are many techniques for handling class imbalance during training such as using a data-driven approach (resampling), as well as an algorithmic approach (ensemble models). At Modzy, we’re conscious of this challenge and have procedures built into our model training processes to minimize the impact of data imbalance.

What You Need to Know

The most popular data-driven techniques for dealing with imbalanced classes are undersampling and oversampling. Undersampling refers to sampling representative data from the majority class that will only be used during training (i.e. the remainder of the data will not be used during training).

Oversampling involves adding copies of samples in the minority class to the training dataset. Originally, these sampling techniques were done at random, however some statistical techniques have been developed to better represent the underlying classes distributions when selecting or discarding samples.

One example would be undersampling using Tomek Links [1], which are pairs of very close samples of opposing classes. Removing the samples of the majority class for each pair increases the distance between the two classes which in turn helps the training process.

Another example would be to oversample using a technique called SMOTE (Synthetic Minority Oversampling Technique) [2]. A random sample is first picked from the minority class, then a number of its neighboring samples are found. Synthetic samples are then added between the chosen sample and its neighbors. This way, new information is being added to the minority class, rather than simply copying existing information.

Although the data-driven techniques discussed help balance the training dataset for more efficient training, they need to be used cautiously because the resulting training dataset is not representative of the real world. Potentially useful information about each class’s proportions are lost using these techniques.

Algorithmic-based ensembling approaches aim to remedy this issue by modifying machine learning algorithms to better handle data and class imbalance. They involve constructing several two-stage models from the original training dataset, or subsets of it, and then aggregating their predictions.

One example ensemble technique would be Bagging (Bootstrap Aggregating) [3], which first constructs k different balanced training subsets from the original training dataset, trains k different models using the subsets, and finally aggregates the model’s results. This technique keeps the training dataset intact while improving model performance.

Our Approach to Data Imbalance

Data preprocessing is a critical part of the model training process done at Modzy, because of the adverse effects that it can have if not done properly, or not done at all. Choosing an approach on how to handle the problem of data imbalance in classification problems is done carefully and exhaustively to ensure the resulting model is unbiased towards a particular class.

What This Means for You

In many real-world scenarios, machine learning classification tasks (threat/fraud detection, tumor classification, etc.) are often naturally highly imbalanced towards one class versus the other. At Modzy, we develop our models using techniques that reduce the adverse effects of data imbalance, while ensuring all class distributions are accurately represented during training.

References:

  • Elhassan, T., and M. Aljurf. “Classification of Imbalance Data using Tomek Link (T-Link) Combined with Random Under-sampling (RUS) as a Data Reduction Method.” (2016).
  • Chawla, Nitesh V., et al. “SMOTE: synthetic minority over-sampling technique.” Journal of artificial intelligence research 16 (2002): 321-357.
  • Breiman, Leo. “Bagging predictors.” Machine learning 24.2 (1996): 123-140.Share on Facebook