Have you worked on machine learning classification problem in the real world? If so, you probably have some experience with imbalance data problem. Imbalance data means the classes we want to predict are disproportional. Classes that make up a large proportion of the data are called majority classes. Those that make up a smaller portion are minority classes. For example, we want to use machine learning models to capture credit card fraud, and fraudulent activities happens approximately 0.1% out of millions of transactions. The majority of regular transactions will impede the machine learning algorithm to identify patterns for the fraudulent activities.
Let’s begin with a simple case using breast cancer data provided by scikit-learn. As shown below, the data is balance.
Then we can build a simple decision tree model and observe the performance. For class 1, the precision shows that 81% of model predicted class 1 is indeed class 1; and the recall shows 97% true positive rate, i.e. 97% of actual class 1 is being captured by the model.
For experiment, we create the imbalance data by sampling 10,000 and 100 times from class 0 and class 1, respectively. The proportion of the minority class is 1%.
Then we apply the same simple decision tree model on this imbalance data. The true positive rate drops from 97% to 33% for class 1. How do we address the issue?
Scikit-learn provides a simple way to fix this issue. By specifying
class_weight = ‘balanced’
for DecisionTreeClassifier, the classifier automatically finds the weights for each class. Our experiment shows that using balanced class weight improves the recall from 33% to 96%, but incurs many false positive and precision decreases from 100% to 36%.clf = DecisionTreeClassifier(random_state=0,min_samples_leaf=30,class_weight='balanced')
clf = clf.fit(X_train, y_train)
Another approach is to apply up-sampling. This means we keep all the majority class, and randomly sample with replacement from minority class to increase the proportion of the minority class. Here, we randomly select 1000 records from class 1. The proportion of the minority class is increased to 13%.
Again we train our simple decision tree model with the up-sampling data. Our experiment shows that up-sampling produces comparable true positive rate with method 1, meanwhile improves the precision rate from 36% to 62%.
In this article, we 1) explain the issue of imbalance data; the simple decision tree model works well with the balance breast cancer data, yet does not work well with the imbalance data. 2)We introduce two methods to address the imbalance data issue. One method is class weight, and the other is up-sampling. The comparison among our benchmark and the two methods are listed below. Note that for both methods we can further improve the performance by fine tuning class weights and minority class ratio.