Application of Synthetic Minority Over-sampling Technique (SMOTe) for Imbalanced Data-sets by@nc2012

February 8th 2019 3,663 reads

In Data Science, imbalanced datasets are no surprises. If the datasets intended for classification problems like Sentiment Analysis, Medical Imaging or other problems related to Discrete Predictive Analytics (for example-Flight Delay Prediction) have unequal number of instances (samples or datapoints) for different classes, then those datasets are said to be imbalanced. This means that there is an imbalance between the classes in the dataset due to large difference between the number of instances belonging to each class. The class having comparatively less number of instances than the other is known to be **minority **with respect to the class having comparatively larger number of the samples (known as **majority**). An example of imbalanced dataset is given below:

Training a Machine Learning Model with this imbalanced dataset, often causes the model to develop a certain bias towards the majority class.

To tackle the issue of class imbalance, Synthetic Minority Over-sampling Technique (SMOTe) was introduced by Chawla et al. [3] in 2002.

Brief description on SMOTe

- SMOTe is a technique based on nearest neighbours judged by Euclidean Distance between datapoints in feature space.
- There is percentage of Over-Sampling which indicates the number of synthetic samples to be created and this percentage parameter of Over-sampling is always a multiple of 100. If the percentage of Over-sampling is 100, then for each instance, a new sample will be created. Hence, the number of minority class instances will get doubled. Similarly, if percentage of Over-sampling is 200, then the total number of minority class samples will get tripled.

In SMOTe,

- For each minority instance, k number of nearest neighbours are found such that they also belong to the same class where,

- The difference between the feature vector of the considered instance and the feature vectors of the k nearest neighbours are found. So, k number of difference vectors are obtained.
- The k difference vectors are each multiplied by a random number between 0 and 1 (excluding 0 and 1).
- Now, the difference vectors, after being multiplied by random numbers, are added to the feature vector of the considered instance (original minority instance) at each iteration.

The Implementation of SMOTe in Python from scratch follows below —

import numpy as np

defnearest_neighbour(X, x):

euclidean = np.ones(X.shape[0]-1)

additive = [None]*(1*X.shape[1])

additive = np.array(additive).reshape(1, X.shape[1])

k = 0

for j in range(0,X.shape[0]):

if np.array_equal(X[j], x) == False:

euclidean[k] = sqrt(sum((X[j]-x)**2))

k = k + 1

euclidean = np.sort(euclidean)

weight = random.random()

while(weight == 0):

weight = random.random()

additive = np.multiply(euclidean[:1],weight)

return additive

defSMOTE_100(X):

new = [None]*(X.shape[0]*X.shape[1])

new = np.array(new).reshape(X.shape[0],X.shape[1])

k = 0

for i in range(0,X.shape[0]):

additive = nearest_neighbour(X, X[i])

for j in range(0,1):

new[k] = X[i] + additive[j]

k = k + 1

return new #the synthetic samples created by SMOTe

Let us consider the **Adult Census Income Prediction Dataset** from UCI containing 48,842 instances and 14 attributes/features.

Data-preprocessing with Python Implementation:

**Label Encoding**is done for categorical (non-numeric) features mentioned in Table 1 (given below) and the label,*income.***Feature Selection**is done based on the Feature Importance Scores given by Extra Trees Classifier on the whole dataset (shown in Table 1). As*race and native-country*give the lowest Feature Importance Scores, these 2 features are excluded in Model Development.**One-Hot Encoding**is done for Categorical Features having more than 2 categories. In One-Hot Encoding, a categorical feature splits into sub-features each corresponding to one of its category (of the main categorical feature) assuming binary values 0/1. Here, the categorical features,*workclass*,*education*,*marital status*,*occupation*and*relationship*are One-Hot Encoded. As*sex*is a feature having only 2 sub-categories (*male*and*female*), it is not further One-Hot Encoded to avoid the curse of dimensionality.

Implementing One-Hot Encoding in Python after Feature Selection ….

import numpy as np

import pandas as pd

from sklearn.preprocessing import OneHotEncoder

#Label Encoding and Feature Selection is over ....

#1. Loading the modified dataset after Label Encoding

df = pd.read_csv('adult.csv')

# Loading of Selected Features into X

X = df.iloc[:,[0,1,2,3,4,5,6,7,9,10,11,12]].values

# Loading of the Label into y

y = df.iloc[:,14].values

#2. One Hot Encoding ....

onehotencoder = OneHotEncoder(categorical_features = [1,3,5,6,7])

X = onehotencoder.fit_transform(X).toarray()

Now, the class label in this problem is **binary**. This means that the class label assumes 2 values i.e., there are 2 classes. So, it is a Binary Classification Problem.

Class Distribution Visualization

#Getting the no. of instances with Label 0

n_class_0 = df[df['income']==0].shape[0]

#Getting the no. of instances with label 1

n_class_1 = df[df['income']==1].shape[0]

#Bar Visualization of Class Distributionimport matplotlib.pyplot as plt # required library

x = ['0', '1']

y = np.array([n_class_0, n_class_1])

plt.bar(x, y)

plt.xlabel('Labels/Classes')

plt.ylabel('Number of Instances')

plt.title('Distribution of Labels/Classes in the Dataset')

So, in the given dataset, there is Gross Imbalance between the 2 classes with Class Label, ‘1’ as Minority and Class Label, ‘0’ as Maority.

Now, there are 2 possible approaches:

- Shuffling and Splitting the Dataset into Training and Validation Sets and applying SMOTe on the Training Dataset. (
**1st Approach**) - Applying SMOTe on the given dataset as a whole and then Shuffle-Splitting the Dataset into Training and Validation Sets. (
**2nd Approach**)

In many web sources like Stack Overflow and in many Personal Blogs, 2nd Approach has been stated as a **wrong method of Over-sampling**. Especially, I’ve seen **Nick Becker’s Personal Blog** [1], where he has mentioned the 2nd Approach as wrong giving the following reason:

“*Application of SMOTe on the whole dataset creates similar instances as the algorithm is based on k-nearest neighbour theory. Due to this reason, Splitting after applying SMOTe on the given dataset, results in information leakage from the Validation Set to the Training Set, thus resulting in the classifier or the Machine Learning Model to over-estimate its accuracy and other performance measures*”

He has also proved this with the help of a practical real-life example by considering a dataset. He has used the imbalanced-learn toolbox [2] for applying SMOTe. Truly speaking, I myself can never really figure out the documentation of the toolbox properly. So, I prefer implementing the SMOTe Algorithm from scratch as demonstrated above. In this article, I am going to demonstrate that 2nd Approach is **NOT wrong **!!!

Let’s follow the 1st Approach as it is being widely accepted throughout.

In order to demonstrate that 2nd Approach is not wrong, I will be Shuffle-Splitting the whole dataset into **Train-Validation** and **Test** Sets. The Test Set will be kept separate as the unknown set of instances. The implementation of the same follows —

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y,

test_size=0.2, random_state=1234)

#X_train and y_train is the Train-Validation Set

#X_test and y_test is the Test Set separated out

- Now, in the
**Train-Validation**Set, 1st and 2nd Approaches will be applied case-wise. - Then, Performance Analysis will be done on the same separated set of unknown instances (
**Test Set**) for both the models (developed following 1st Approach and 2nd Approach)

Following 1st Approach of using SMOTe after Splitting

=> Splitting the **Train-Validation **Set into **Training** and **Validation** Sets. The implementation of the same follows —

X_train, X_v, y_train, y_v = train_test_split(X_train, y_train,

test_size=0.2, random_state=2341)

#X_train and y_train is the Training Set#X_v and y_v is the Validation Set

=> Applying SMOTe only on the Training Set

#1. Getting the number of Minority Class Instances in Training Set

import numpy as np # required library

unique, counts = np.unique(y_train, return_counts=True)

minority_shape = dict(zip(unique, counts))[1]

#2.Storing the minority class instances separately

x1 = np.ones((minority_shape, X_train.shape[1]))

k=0

for i in range(0,X_train.shape[0]):

if y_train[i] == 1.0:

x1[k] = X[i]

k = k + 1

# 3.Applying 100% SMOTesampled_instances = SMOTE_100(x1)

#Keeping the artificial instances and original instances together

X_f = np.concatenate((X_train,sampled_instances), axis = 0)

y_sampled_instances = np.ones(minority_shape)

y_f = np.concatenate((y_train,y_sampled_instances), axis=0)

#X_f and y_f are the Training Set Features and Labels respectively

Model Training using Gradient Boosting Classifier

Gradient Boosting Classifier is used for Training the Machine Learning Model. Grid-Search is used on Gradient Boosting Classifier for obtaining the best set of hyper-parameters which are the number of estimators and max_depth.

from sklearn.ensemble import GradientBoostingClassifier

from sklearn.model_selection import GridSearchCV

parameters = {'n_estimators':[100,150,200,250,300,350,400,450,500],

'max_depth':[3,4,5]}

clf= GradientBoostingClassifier()

grid_search = GridSearchCV(param_grid = parameters, estimator = clf,

verbose = 3)

grid_search_1 = grid_search.fit(X_f,y_f)

So, the **Trained Machine Learning Model following 1st Approach **is embedded in grid_search_1.

Following 2nd Approach of using SMOTe before Splitting

=> Applying SMOTe on the whole **Train-Validation **Set:

#1. Getting the number of Minority Class Instances in Training Set

unique, counts = np.unique(y_train, return_counts=True)

minority_shape = dict(zip(unique, counts))[1]

#2.Storing the minority class instances separately

x1 = np.ones((minority_shape, X_train.shape[1]))

k=0

for i in range(0,X_train.shape[0]):

if y_train[i] == 1.0:

x1[k] = X[i]

k = k + 1

# 3.Applying 100% SMOTesampled_instances = SMOTE_100(x1)

#Keeping the artificial instances and original instances together

X_f = np.concatenate((X_train,sampled_instances), axis = 0)

y_sampled_instances = np.ones(minority_shape)

y_f = np.concatenate((y_train,y_sampled_instances), axis=0)

#X_f and y_f are the Train-Validation Set Features and Labels respectively

=> Splitting the **Train-Validation **Set into **Training** and **Validation** Sets. The implementation of the same follows —

X_train, X_v, y_train, y_v = train_test_split(X_f, y_f,

test_size=0.2, random_state=9999)

#X_train and y_train is the Training Set#X_v and y_v is the Validation Set

Model Training using Gradient Boosting Classifier

Similarly, Grid Search is applied on Gradient Boosting Classifier

from sklearn.ensemble import GradientBoostingClassifier

from sklearn.model_selection import GridSearchCV

parameters = {'n_estimators':[100,150,200,250,300,350,400,450,500],

'max_depth':[3,4,5]}

clf= GradientBoostingClassifier()

grid_search = GridSearchCV(param_grid = parameters, estimator = clf,

verbose = 3)

grid_search_2 = grid_search.fit(X_train,y_train)

So, the **Trained Machine Learning Model following 2nd Approach **is embedded in grid_search_2.

PERFORMANCE ANALYSIS AND COMPARISON

The Performance Metrics used for Comparison and Analysis are:

- Accuracy
**on the Test Set (Test Accuracy)** - Precision
**on the Test Set** - Recall
**on the Test Set** - F1-Score
**on the Test Set**

Other than these comparison metrics, Training Accuracy (Training Set) and Validation Accuracy (Validation Set) are computed also.

#MODEL 1 PERFORMANCE ANALYSIS

#1. Training Accuracy for Model 1 (following Approach 1)

print(grid_search_1.score(X_f, y_f))

#2. Validation Accuracy on Validation Set for Model 1

print(grid_search_1.score(X_v, y_v))

#3. Test Accuracy on Test Set for Model 1

print(grid_search_1.score(X_test, y_test))

#4. Precision, Recall and F1-Score on the Test Set for Model 1

from sklearn.metrics import classification_report

predictions=grid_search_1.predict(X_test)

print(classification_report(y_test,predictions))

#MODEL 2 PERFORMANCE ANALYSIS

#5. Training Accuracy for Model 2(following Approach 2)

print(grid_search_2.score(X_train, y_train))

#6. Validation Accuracy on Validation Set for Model 2

print(grid_search_2.score(X_v, y_v))

#3. Test Accuracy on Test Set for Model 2

print(grid_search_2.score(X_test, y_test))

#4. Precision, Recall and F1-Score on the Test Set for Model 2

from sklearn.metrics import classification_report

predictions=grid_search_2.predict(X_test)

print(classification_report(y_test,predictions))

Training and Validation Set Accuracy for Model 1 and Model 2 are:

- Training Accuracy (Model 1): 90.64998262078554%
- Training Accuracy (Model 2): 90.92736479956705%
- Validation Accuracy (Model 1): 86.87140115163148%
- Validation Accuracy (Model 2): 89.33209647495362%

So, from here apparently **2nd Approach **shows higher Validation Accuracy but without testing on completely unknown and same Test Set, no conclusion can be drawn. A Comparison Chart is shown between the 2 models’ performances on the Test Set in Table 2.

So, clearly it is proven that however small difference may be, **Approach 2 is clearly more successful than Approach 1 **and hence, can’t be stated as wrong with the aforementioned reasons by Nick Becker [1] because

*“ Though SMOTe creates similar instances, on the other hand this property is required not only for Class Imbalance Reduction and Data Augmentation but also to find the best Training Set suitable for Model Training. If the Training Set is not versatile, how can Model Performance be enhanced? As far as the bleeding of information from Validation to Training Set is concerned, even if it occurs, it contributes to making the Training Set even better and helps in Robust Machine Learning Model Development as it is proved that for completely unknown instances, **Approach 2 **performed better than **Approach 1**”*

REFERENCES

[1] https://beckernick.github.io/oversampling-modeling/

[2] https://imbalanced-learn.readthedocs.io/en/stable/

[3] Chawla, Nitesh V., et al. “SMOTE: synthetic minority over-sampling technique.” *Journal of artificial intelligence research*16 (2002): 321–357.

For personal contacts regarding the article or discussions on Machine Learning/Data Mining or any department of Data Science, feel free to reach out to me on **LinkedIn**