Application of Synthetic Minority Over-sampling Technique (SMOTe) for Imbalanced Data-sets

In Data Science, imbalanced datasets are no surprises. If the datasets intended for classification problems like Sentiment Analysis, Medical Imaging or other problems related to Discrete Predictive Analytics (for example-Flight Delay Prediction) have unequal number of instances (samples or datapoints) for different classes, then those datasets are said to be imbalanced. This means that there is an imbalance between the classes in the dataset due to large difference between the number of instances belonging to each class. The class having comparatively less number of instances than the other is known to be minority with respect to the class having comparatively larger number of the samples (known as majority). An example of imbalanced dataset is given below:

Here, there are 2 classes with labels: 0 and 1 with Gross Imbalance

Training a Machine Learning Model with this imbalanced dataset, often causes the model to develop a certain bias towards the majority class.

To tackle the issue of class imbalance, Synthetic Minority Over-sampling Technique (SMOTe) was introduced by Chawla et al. [3] in 2002.

Brief description on SMOTe
  1. SMOTe is a technique based on nearest neighbours judged by Euclidean Distance between datapoints in feature space.
  2. There is percentage of Over-Sampling which indicates the number of synthetic samples to be created and this percentage parameter of Over-sampling is always a multiple of 100. If the percentage of Over-sampling is 100, then for each instance, a new sample will be created. Hence, the number of minority class instances will get doubled. Similarly, if percentage of Over-sampling is 200, then the total number of minority class samples will get tripled.

In SMOTe,

  • For each minority instance, k number of nearest neighbours are found such that they also belong to the same class where,
  • The difference between the feature vector of the considered instance and the feature vectors of the k nearest neighbours are found. So, k number of difference vectors are obtained.
  • The k difference vectors are each multiplied by a random number between 0 and 1 (excluding 0 and 1).
  • Now, the difference vectors, after being multiplied by random numbers, are added to the feature vector of the considered instance (original minority instance) at each iteration.

The Implementation of SMOTe in Python from scratch follows below —

import numpy as np
def nearest_neighbour(X, x):
euclidean = np.ones(X.shape[0]-1)

additive = [None]*(1*X.shape[1])
additive = np.array(additive).reshape(1, X.shape[1])
k = 0
for j in range(0,X.shape[0]):
if np.array_equal(X[j], x) == False:
euclidean[k] = sqrt(sum((X[j]-x)**2))
k = k + 1
euclidean = np.sort(euclidean)
weight = random.random()
while(weight == 0):
weight = random.random()
additive = np.multiply(euclidean[:1],weight)
return additive

def SMOTE_100(X):
new = [None]*(X.shape[0]*X.shape[1])
new = np.array(new).reshape(X.shape[0],X.shape[1])
k = 0
for i in range(0,X.shape[0]):
additive = nearest_neighbour(X, X[i])
for j in range(0,1):
new[k] = X[i] + additive[j]
k = k + 1
return new # the synthetic samples created by SMOTe

Application of SMOTe in practice

Let us consider the Adult Census Income Prediction Dataset from UCI containing 48,842 instances and 14 attributes/features.

Data-preprocessing with Python Implementation:
  1. Label Encoding is done for categorical (non-numeric) features mentioned in Table 1 (given below) and the label, income.
  2. Feature Selection is done based on the Feature Importance Scores given by Extra Trees Classifier on the whole dataset (shown in Table 1). As race and native-country give the lowest Feature Importance Scores, these 2 features are excluded in Model Development.
  3. One-Hot Encoding is done for Categorical Features having more than 2 categories. In One-Hot Encoding, a categorical feature splits into sub-features each corresponding to one of its category (of the main categorical feature) assuming binary values 0/1. Here, the categorical features, workclass, education, marital status, occupation and relationship are One-Hot Encoded. As sex is a feature having only 2 sub-categories (male and female), it is not further One-Hot Encoded to avoid the curse of dimensionality.
Table 1

Implementing One-Hot Encoding in Python after Feature Selection ….

import numpy as np
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
# Label Encoding and Feature Selection is over ....
# 1. Loading the modified dataset after Label Encoding
df = pd.read_csv('adult.csv')
# Loading of Selected Features into X
X = df.iloc[:,[0,1,2,3,4,5,6,7,9,10,11,12]].values
# Loading of the Label into y
y = df.iloc[:,14].values
# 2. One Hot Encoding ....
onehotencoder = OneHotEncoder(categorical_features = [1,3,5,6,7])
X = onehotencoder.fit_transform(X).toarray()

Now, the class label in this problem is binary. This means that the class label assumes 2 values i.e., there are 2 classes. So, it is a Binary Classification Problem.

Class Distribution Visualization
# Getting the no. of instances with Label 0
n_class_0 = df[df['income']==0].shape[0]
# Getting the no. of instances with label 1
n_class_1 = df[df['income']==1].shape[0]
# Bar Visualization of Class Distribution
import matplotlib.pyplot as plt # required library
x = ['0', '1']
y = np.array([n_class_0, n_class_1])
plt.bar(x, y)
plt.xlabel('Labels/Classes')
plt.ylabel('Number of Instances')
plt.title('Distribution of Labels/Classes in the Dataset')
Class Distribution in the Adult Dataset

So, in the given dataset, there is Gross Imbalance between the 2 classes with Class Label, ‘1’ as Minority and Class Label, ‘0’ as Maority.

Now, there are 2 possible approaches:

  1. Shuffling and Splitting the Dataset into Training and Validation Sets and applying SMOTe on the Training Dataset. (1st Approach)
  2. Applying SMOTe on the given dataset as a whole and then Shuffle-Splitting the Dataset into Training and Validation Sets. (2nd Approach)

In many web sources like Stack Overflow and in many Personal Blogs, 2nd Approach has been stated as a wrong method of Over-sampling. Especially, I’ve seen Nick Becker’s Personal Blog [1], where he has mentioned the 2nd Approach as wrong giving the following reason:

Application of SMOTe on the whole dataset creates similar instances as the algorithm is based on k-nearest neighbour theory. Due to this reason, Splitting after applying SMOTe on the given dataset, results in information leakage from the Validation Set to the Training Set, thus resulting in the classifier or the Machine Learning Model to over-estimate its accuracy and other performance measures

He has also proved this with the help of a practical real-life example by considering a dataset. He has used the imbalanced-learn toolbox [2] for applying SMOTe. Truly speaking, I myself can never really figure out the documentation of the toolbox properly. So, I prefer implementing the SMOTe Algorithm from scratch as demonstrated above. In this article, I am going to demonstrate that 2nd Approach is NOT wrong !!!

Let’s follow the 1st Approach as it is being widely accepted throughout.

In order to demonstrate that 2nd Approach is not wrong, I will be Shuffle-Splitting the whole dataset into Train-Validation and Test Sets. The Test Set will be kept separate as the unknown set of instances. The implementation of the same follows —

from sklearn.model_selection import train_test_split 
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2, random_state=1234)
# X_train and y_train is the Train-Validation Set
# X_test and y_test is the Test Set separated out
  1. Now, in the Train-Validation Set, 1st and 2nd Approaches will be applied case-wise.
  2. Then, Performance Analysis will be done on the same separated set of unknown instances (Test Set) for both the models (developed following 1st Approach and 2nd Approach)
Following 1st Approach of using SMOTe after Splitting

=> Splitting the Train-Validation Set into Training and Validation Sets. The implementation of the same follows —

X_train, X_v, y_train, y_v = train_test_split(X_train, y_train,
test_size=0.2, random_state=2341)
# X_train and y_train is the Training Set
# X_v and y_v is the Validation Set

=> Applying SMOTe only on the Training Set

# 1. Getting the number of Minority Class Instances in Training Set
import numpy as np # required library
unique, counts = np.unique(y_train, return_counts=True)
minority_shape = dict(zip(unique, counts))[1]
# 2. Storing the minority class instances separately
x1 = np.ones((minority_shape, X_train.shape[1]))
k=0
for i in range(0,X_train.shape[0]):
if y_train[i] == 1.0:
x1[k] = X[i]
k = k + 1
# 3. Applying 100% SMOTe
sampled_instances = SMOTE_100(x1)
# Keeping the artificial instances and original instances together
X_f = np.concatenate((X_train,sampled_instances), axis = 0)
y_sampled_instances = np.ones(minority_shape)
y_f = np.concatenate((y_train,y_sampled_instances), axis=0)
# X_f and y_f are the Training Set Features and Labels respectively
Model Training using Gradient Boosting Classifier

Gradient Boosting Classifier is used for Training the Machine Learning Model. Grid-Search is used on Gradient Boosting Classifier for obtaining the best set of hyper-parameters which are the number of estimators and max_depth.

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV
parameters = {'n_estimators':[100,150,200,250,300,350,400,450,500],
'max_depth':[3,4,5]}
clf= GradientBoostingClassifier()
grid_search = GridSearchCV(param_grid = parameters, estimator = clf,
verbose = 3)
grid_search_1 = grid_search.fit(X_f,y_f)

So, the Trained Machine Learning Model following 1st Approach is embedded in grid_search_1.

Following 2nd Approach of using SMOTe before Splitting

=> Applying SMOTe on the whole Train-Validation Set:

# 1. Getting the number of Minority Class Instances in Training Set
unique, counts = np.unique(y_train, return_counts=True)
minority_shape = dict(zip(unique, counts))[1]
# 2. Storing the minority class instances separately
x1 = np.ones((minority_shape, X_train.shape[1]))
k=0
for i in range(0,X_train.shape[0]):
if y_train[i] == 1.0:
x1[k] = X[i]
k = k + 1
# 3. Applying 100% SMOTe
sampled_instances = SMOTE_100(x1)
# Keeping the artificial instances and original instances together
X_f = np.concatenate((X_train,sampled_instances), axis = 0)
y_sampled_instances = np.ones(minority_shape)
y_f = np.concatenate((y_train,y_sampled_instances), axis=0)
# X_f and y_f are the Train-Validation Set Features and Labels respectively

=> Splitting the Train-Validation Set into Training and Validation Sets. The implementation of the same follows —

X_train, X_v, y_train, y_v = train_test_split(X_f, y_f,
test_size=0.2, random_state=9999)
# X_train and y_train is the Training Set
# X_v and y_v is the Validation Set
Model Training using Gradient Boosting Classifier

Similarly, Grid Search is applied on Gradient Boosting Classifier

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV
parameters = {'n_estimators':[100,150,200,250,300,350,400,450,500],
'max_depth':[3,4,5]}
clf= GradientBoostingClassifier()
grid_search = GridSearchCV(param_grid = parameters, estimator = clf,
verbose = 3)
grid_search_2 = grid_search.fit(X_train,y_train)

So, the Trained Machine Learning Model following 2nd Approach is embedded in grid_search_2.

PERFORMANCE ANALYSIS AND COMPARISON

The Performance Metrics used for Comparison and Analysis are:

  1. Accuracy on the Test Set (Test Accuracy)
  2. Precision on the Test Set
  3. Recall on the Test Set
  4. F1-Score on the Test Set

Other than these comparison metrics, Training Accuracy (Training Set) and Validation Accuracy (Validation Set) are computed also.

# MODEL 1 PERFORMANCE ANALYSIS
# 1. Training Accuracy for Model 1 (following Approach 1)
print(grid_search_1.score(X_f, y_f))
# 2. Validation Accuracy on Validation Set for Model 1 
print(grid_search_1.score(X_v, y_v))
# 3. Test Accuracy on Test Set for Model 1
print(grid_search_1.score(X_test, y_test))
# 4. Precision, Recall and F1-Score on the Test Set for Model 1
from sklearn.metrics import classification_report
predictions=grid_search_1.predict(X_test)
print(classification_report(y_test,predictions))
# MODEL 2 PERFORMANCE ANALYSIS
# 5. Training Accuracy for Model 2(following Approach 2)
print(grid_search_2.score(X_train, y_train))
# 6. Validation Accuracy on Validation Set for Model 2
print(grid_search_2.score(X_v, y_v))
# 3. Test Accuracy on Test Set for Model 2
print(grid_search_2.score(X_test, y_test))
# 4. Precision, Recall and F1-Score on the Test Set for Model 2
from sklearn.metrics import classification_report
predictions=grid_search_2.predict(X_test)
print(classification_report(y_test,predictions))

Training and Validation Set Accuracy for Model 1 and Model 2 are:

  1. Training Accuracy (Model 1): 90.64998262078554%
  2. Training Accuracy (Model 2): 90.92736479956705%
  3. Validation Accuracy (Model 1): 86.87140115163148%
  4. Validation Accuracy (Model 2): 89.33209647495362%

So, from here apparently 2nd Approach shows higher Validation Accuracy but without testing on completely unknown and same Test Set, no conclusion can be drawn. A Comparison Chart is shown between the 2 models’ performances on the Test Set in Table 2.

Table 2

So, clearly it is proven that however small difference may be, Approach 2 is clearly more successful than Approach 1 and hence, can’t be stated as wrong with the aforementioned reasons by Nick Becker [1] because

“ Though SMOTe creates similar instances, on the other hand this property is required not only for Class Imbalance Reduction and Data Augmentation but also to find the best Training Set suitable for Model Training. If the Training Set is not versatile, how can Model Performance be enhanced? As far as the bleeding of information from Validation to Training Set is concerned, even if it occurs, it contributes to making the Training Set even better and helps in Robust Machine Learning Model Development as it is proved that for completely unknown instances, Approach 2 performed better than Approach 1

REFERENCES

[1] https://beckernick.github.io/oversampling-modeling/

[2] https://imbalanced-learn.readthedocs.io/en/stable/

[3] Chawla, Nitesh V., et al. “SMOTE: synthetic minority over-sampling technique.” Journal of artificial intelligence research16 (2002): 321–357.

For personal contacts regarding the article or discussions on Machine Learning/Data Mining or any department of Data Science, feel free to reach out to me on LinkedIn

More by Navoneel Chakrabarty

Topics of interest

More Related Stories