In Data Science, imbalanced datasets are no surprises. If the datasets intended for classification problems like Sentiment Analysis, Medical Imaging or other problems related to Discrete Predictive Analytics (for example-Flight Delay Prediction) have unequal number of instances (samples or datapoints) for different classes, then those datasets are said to be imbalanced. This means that there is an imbalance between the classes in the dataset due to large difference between the number of instances belonging to each class. The class having comparatively less number of instances than the other is known to be minority with respect to the class having comparatively larger number of the samples (known as majority). An example of imbalanced dataset is given below:
Here, there are 2 classes with labels: 0 and 1 with Gross Imbalance
Training a Machine Learning Model with this imbalanced dataset, often causes the model to develop a certain bias towards the majority class.
To tackle the issue of class imbalance, Synthetic Minority Over-sampling Technique (SMOTe) was introduced by Chawla et al. [3] in 2002.
Brief description on SMOTe
In SMOTe,
The Implementation of SMOTe in Python from scratch follows below —
import numpy as np
def nearest_neighbour(X, x):euclidean = np.ones(X.shape[0]-1)
additive = \[None\]\*(1\*X.shape\[1\])
additive = np.array(additive).reshape(1, X.shape\[1\])
k = 0
for j in range(0,X.shape\[0\]):
if np.array\_equal(X\[j\], x) == False:
euclidean\[k\] = sqrt(sum((X\[j\]-x)\*\*2))
k = k + 1
euclidean = np.sort(euclidean)
weight = random.random()
while(weight == 0):
weight = random.random()
additive = np.multiply(euclidean\[:1\],weight)
return additive
def SMOTE_100(X):new = [None]*(X.shape[0]*X.shape[1])new = np.array(new).reshape(X.shape[0],X.shape[1])k = 0for i in range(0,X.shape[0]):additive = nearest_neighbour(X, X[i])for j in range(0,1):new[k] = X[i] + additive[j]k = k + 1return new # the synthetic samples created by SMOTe
Let us consider the Adult Census Income Prediction Dataset from UCI containing 48,842 instances and 14 attributes/features.
Data-preprocessing with Python Implementation:
Table 1
Implementing One-Hot Encoding in Python after Feature Selection ….
import numpy as npimport pandas as pdfrom sklearn.preprocessing import OneHotEncoder
# Label Encoding and Feature Selection is over ....
# 1. Loading the modified dataset after Label Encodingdf = pd.read_csv('adult.csv')# Loading of Selected Features into XX = df.iloc[:,[0,1,2,3,4,5,6,7,9,10,11,12]].values
# Loading of the Label into yy = df.iloc[:,14].values
# 2. One Hot Encoding ....onehotencoder = OneHotEncoder(categorical_features = [1,3,5,6,7])X = onehotencoder.fit_transform(X).toarray()
Now, the class label in this problem is binary. This means that the class label assumes 2 values i.e., there are 2 classes. So, it is a Binary Classification Problem.
Class Distribution Visualization
# Getting the no. of instances with Label 0n_class_0 = df[df['income']==0].shape[0]
# Getting the no. of instances with label 1n_class_1 = df[df['income']==1].shape[0]
# **Bar Visualization of Class Distribution**import matplotlib.pyplot as plt # required library
x = ['0', '1']y = np.array([n_class_0, n_class_1])plt.bar(x, y)plt.xlabel('Labels/Classes')plt.ylabel('Number of Instances')plt.title('Distribution of Labels/Classes in the Dataset')
Class Distribution in the Adult Dataset
So, in the given dataset, there is Gross Imbalance between the 2 classes with Class Label, ‘1’ as Minority and Class Label, ‘0’ as Maority.
Now, there are 2 possible approaches:
In many web sources like Stack Overflow and in many Personal Blogs, 2nd Approach has been stated as a wrong method of Over-sampling. Especially, I’ve seen Nick Becker’s Personal Blog [1], where he has mentioned the 2nd Approach as wrong giving the following reason:
“Application of SMOTe on the whole dataset creates similar instances as the algorithm is based on k-nearest neighbour theory. Due to this reason, Splitting after applying SMOTe on the given dataset, results in information leakage from the Validation Set to the Training Set, thus resulting in the classifier or the Machine Learning Model to over-estimate its accuracy and other performance measures”
He has also proved this with the help of a practical real-life example by considering a dataset. He has used the imbalanced-learn toolbox [2] for applying SMOTe. Truly speaking, I myself can never really figure out the documentation of the toolbox properly. So, I prefer implementing the SMOTe Algorithm from scratch as demonstrated above. In this article, I am going to demonstrate that 2nd Approach is NOT wrong !!!
Let’s follow the 1st Approach as it is being widely accepted throughout.
In order to demonstrate that 2nd Approach is not wrong, I will be Shuffle-Splitting the whole dataset into Train-Validation and Test Sets. The Test Set will be kept separate as the unknown set of instances. The implementation of the same follows —
from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.2, random_state=1234)# X_train and y_train is the Train-Validation Set# X_test and y_test is the Test Set separated out
Following 1st Approach of using SMOTe after Splitting
=> Splitting the Train-Validation Set into Training and Validation Sets. The implementation of the same follows —
X_train, X_v, y_train, y_v = train_test_split(X_train, y_train,test_size=0.2, random_state=2341)# **X_train and y_train is the Training Set**# X_v and y_v is the Validation Set
=> Applying SMOTe only on the Training Set
# 1. Getting the number of Minority Class Instances in Training Setimport numpy as np # required libraryunique, counts = np.unique(y_train, return_counts=True)minority_shape = dict(zip(unique, counts))[1]
# 2. Storing the minority class instances separatelyx1 = np.ones((minority_shape, X_train.shape[1]))k=0for i in range(0,X_train.shape[0]):if y_train[i] == 1.0:x1[k] = X[i]k = k + 1
# 3. **Applying 100% SMOTe**sampled_instances = SMOTE_100(x1)
# Keeping the artificial instances and original instances togetherX_f = np.concatenate((X_train,sampled_instances), axis = 0)y_sampled_instances = np.ones(minority_shape)y_f = np.concatenate((y_train,y_sampled_instances), axis=0)# X_f and y_f are the Training Set Features and Labels respectively
Model Training using Gradient Boosting Classifier
Gradient Boosting Classifier is used for Training the Machine Learning Model. Grid-Search is used on Gradient Boosting Classifier for obtaining the best set of hyper-parameters which are the number of estimators and max_depth.
from sklearn.ensemble import GradientBoostingClassifierfrom sklearn.model_selection import GridSearchCV
parameters = {'n_estimators':[100,150,200,250,300,350,400,450,500],'max_depth':[3,4,5]}clf= GradientBoostingClassifier()grid_search = GridSearchCV(param_grid = parameters, estimator = clf,verbose = 3)
grid_search_1 = grid_search.fit(X_f,y_f)
So, the Trained Machine Learning Model following 1st Approach is embedded in grid_search_1.
Following 2nd Approach of using SMOTe before Splitting
=> Applying SMOTe on the whole Train-Validation Set:
# 1. Getting the number of Minority Class Instances in Training Setunique, counts = np.unique(y_train, return_counts=True)minority_shape = dict(zip(unique, counts))[1]
# 2. Storing the minority class instances separatelyx1 = np.ones((minority_shape, X_train.shape[1]))k=0for i in range(0,X_train.shape[0]):if y_train[i] == 1.0:x1[k] = X[i]k = k + 1
# 3. **Applying 100% SMOTe**sampled_instances = SMOTE_100(x1)
# Keeping the artificial instances and original instances togetherX_f = np.concatenate((X_train,sampled_instances), axis = 0)y_sampled_instances = np.ones(minority_shape)y_f = np.concatenate((y_train,y_sampled_instances), axis=0)# X_f and y_f are the Train-Validation Set Features and Labels respectively
=> Splitting the Train-Validation Set into Training and Validation Sets. The implementation of the same follows —
X_train, X_v, y_train, y_v = train_test_split(X_f, y_f,test_size=0.2, random_state=9999)
# **X_train and y_train is the Training Set**# X_v and y_v is the Validation Set
Model Training using Gradient Boosting Classifier
Similarly, Grid Search is applied on Gradient Boosting Classifier
from sklearn.ensemble import GradientBoostingClassifierfrom sklearn.model_selection import GridSearchCV
parameters = {'n_estimators':[100,150,200,250,300,350,400,450,500],'max_depth':[3,4,5]}clf= GradientBoostingClassifier()grid_search = GridSearchCV(param_grid = parameters, estimator = clf,verbose = 3)
grid_search_2 = grid_search.fit(X_train,y_train)
So, the Trained Machine Learning Model following 2nd Approach is embedded in grid_search_2.
PERFORMANCE ANALYSIS AND COMPARISON
The Performance Metrics used for Comparison and Analysis are:
Other than these comparison metrics, Training Accuracy (Training Set) and Validation Accuracy (Validation Set) are computed also.
# MODEL 1 PERFORMANCE ANALYSIS
# 1. Training Accuracy for Model 1 (following Approach 1)print(grid_search_1.score(X_f, y_f))
# 2. Validation Accuracy on Validation Set for Model 1print(grid_search_1.score(X_v, y_v))
# 3. Test Accuracy on Test Set for Model 1print(grid_search_1.score(X_test, y_test))
# 4. Precision, Recall and F1-Score on the Test Set for Model 1from sklearn.metrics import classification_reportpredictions=grid_search_1.predict(X_test)print(classification_report(y_test,predictions))
# MODEL 2 PERFORMANCE ANALYSIS
# 5. Training Accuracy for Model 2(following Approach 2)print(grid_search_2.score(X_train, y_train))
# 6. Validation Accuracy on Validation Set for Model 2print(grid_search_2.score(X_v, y_v))
# 3. Test Accuracy on Test Set for Model 2print(grid_search_2.score(X_test, y_test))
# 4. Precision, Recall and F1-Score on the Test Set for Model 2from sklearn.metrics import classification_reportpredictions=grid_search_2.predict(X_test)print(classification_report(y_test,predictions))
Training and Validation Set Accuracy for Model 1 and Model 2 are:
So, from here apparently 2nd Approach shows higher Validation Accuracy but without testing on completely unknown and same Test Set, no conclusion can be drawn. A Comparison Chart is shown between the 2 models’ performances on the Test Set in Table 2.
Table 2
So, clearly it is proven that however small difference may be, Approach 2 is clearly more successful than Approach 1 and hence, can’t be stated as wrong with the aforementioned reasons by Nick Becker [1] because
“ Though SMOTe creates similar instances, on the other hand this property is required not only for Class Imbalance Reduction and Data Augmentation but also to find the best Training Set suitable for Model Training. If the Training Set is not versatile, how can Model Performance be enhanced? As far as the bleeding of information from Validation to Training Set is concerned, even if it occurs, it contributes to making the Training Set even better and helps in Robust Machine Learning Model Development as it is proved that for completely unknown instances, Approach 2 performed better than Approach 1”
REFERENCES
[1] https://beckernick.github.io/oversampling-modeling/
[2] https://imbalanced-learn.readthedocs.io/en/stable/
[3] Chawla, Nitesh V., et al. “SMOTE: synthetic minority over-sampling technique.” _Journal of artificial intelligence research_16 (2002): 321–357.
For personal contacts regarding the article or discussions on Machine Learning/Data Mining or any department of Data Science, feel free to reach out to me on LinkedIn
Navoneel Chakrabarty - Contributing Author - Towards Data Science | LinkedIn_View Navoneel Chakrabarty's profile on LinkedIn, the world's largest professional community. Navoneel has 4 jobs listed…_www.linkedin.com