Practical Tips For Binary Classification Excellence

Imagine if you could get all the tips and tricks you need to tackle a binary classification problem on Kaggle or anywhere else. I have gone over 10 Kaggle competitions including:

Toxic Comment Classification Challenge $35,000
TalkingData AdTracking Fraud Detection Challenge $25,000
IEEE-CI S Fraud Detection $20,000
Jigsaw Multilingual Toxic Comment Classification $50,000
RSNA Intracranial Hemorrhage Detection $25,000
SIIM-ACR Pneumothorax Segmentation $30,000
Jigsaw Unintended Bias in Toxicity Classification $65,000
Santander Customer Transaction Prediction $65,000
Microsoft Malware Prediction $25,000
Humpback Whale Identification $25,000

– and pulled out that information for you.

Dive in.

Modeling

Use two BiGru layers feeding into two final Dense layers
Decide on the best parameters by selecting the best out of 250 runs with Bayesian optimization
Use a 2-level bidirectional GRU followed by max-pooling and 2 fully-connected layers

Dealing with imbalance problems

Metrics

Global AUC
ROC AUC score for fraud detection explained (+alternatives)
Good Old Accuracy
Mean Average Precision (MAP)
Binary Log Loss
F beta score, where beta is equal to 0.5

Loss

BCE and Dice Based
BCE with Logit Loss
DICE loss for segmentation BCE + DICE Combinations of BCE, Dice and Focal loss
Focal Loss Based
Focal Loss Function
BCE + focal loss
Weighted Focal Loss
Custom Losses
Weighted subgroup negative samples
Others
Lovasz Loss
Weighted sigmoid cross-entropy loss
Hard triplet loss
Center Loss
Additive Angular Margin Loss
Margin loss
CosFace Loss
KL-Div loss with soft label

Cross-validation + proper evaluation

Use Adversarial validation
Apply GroupKFold cross-validation
Simple time-split and using about last 100k records as a validation set
Generate predictions using unshuffled KFold
Use stratified 5 fold without early stopping for predicting test data
Implement LightGBM on 10 KFolds with no shuffle
If using pseudo labeling, don’t validate on the pseudo labels to avoid overfitting
Use the Standard 10 fold Stratified cross-validation with multiple seeds for the final blend

Post-processing

Ensembling

Averaging

Averaging over multiple seeds

Average 10 out-of-fold predictions
Average multiple seeds
Add model diversity by seed averaging and bagging models with different folds

Geometric mean

An ensemble of LightGBM, CatBoost weighted geometric mean

Average different models

Stacking

Stack Bi-LSTM, Bert-Large-Uncased with WWM, XLNET, with the meta model as ExtraTreesClassifier
LightGBM Stacking
Stack LightGBM with heavy bayesian optimization
Stack models using PyStackNet and MlXtend
An ensemble of RNN, CNN, LightGBM, and NBSV
Use 5 time bagged XGB
CV scores with heavy bayesian optimization

Blending

Use power blending
Blend using Hyperopt and OOF to find optimal weights

Others

Implement Hillclimb ensembling
Apply LGB bagged 10 times with different training data samples

Repositories and open solutions

Repos with open source solutions

Image based solutions

Tabular based solutions

Text classification based solutions

Toxic Comment Classification Challenge, 12th p lace solution
Code and write-up for the Kaggle Toxic Comment Classification Challenge
Jigsaw Unintended Bias in Toxicity Classification 4th Place Solution
An open solution to the Toxic Comment Classification Challenge
TalkingData AdTracking Fraud Detection Challenge 4th Place Solution
Bronze medal Jigsaw Solution
2nd place solution for the 2017 national data science bowl
Jigsaw Unintended Bias in Toxicity Classification 10th Place Solution
Code for 3rd place solution in Kaggle Humpback Whale Identification Challenge

Final thoughts

Hopefully, this article gave you some background into binary classification tips and tricks, as well as, some tools and frameworks that you can use to start competing.

We’ve covered tips on:

architectures,
losses,
post-processing,
ensembling,
tools and frameworks.

If you want to go deeper, simply follow the links and see how the best binary classification models are built.

Practical Tips For Binary Classification Excellence

Too Long; Didn't Read

Companies Mentioned

Coin Mentioned

Modeling

Dealing with imbalance problems

Metrics

Loss

Cross-validation + proper evaluation

Post-processing

Ensembling

Repositories and open solutions

Final thoughts

About Author

TOPICS

THIS ARTICLE WAS FEATURED IN...

Categories

Trending Topics

Practical Tips For Binary Classification Excellence

Too Long; Didn't Read

Companies Mentioned

Coin Mentioned

Modeling

Dealing with imbalance problems

Metrics

Loss

Cross-validation + proper evaluation

Post-processing

Ensembling

Repositories and open solutions

Final thoughts

About Author

TOPICS

THIS ARTICLE WAS FEATURED IN...

RELATED STORIES

Categories

Trending Topics