paint-brush
Text Classification Models: All Tips And Tricks From 5 Kaggle Competitionsby@neptuneAI_jakub
567 reads
567 reads

Text Classification Models: All Tips And Tricks From 5 Kaggle Competitions

by neptune.ai Jakub CzakonMay 30th, 2020
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

In this article (originally posted by Shahul ES on the Neptune blog), I will discuss some great tips and tricks to improve the performance of your text classification model. These tricks are obtained from solutions of some of Kaggle’s top NLP competitions. I’ve gone through: Jigsaw Unintended Bias in Toxicity Classification, Quora Insincere Questions Classification, Google QUEST Q&A Labeling and TensorFlow 2.0 Question Answering.

Companies Mentioned

Mention Thumbnail
Mention Thumbnail
featured image - Text Classification Models: All Tips And Tricks From 5 Kaggle Competitions
neptune.ai Jakub Czakon HackerNoon profile picture

In this article (originally posted by Shahul ES on the Neptune blog), I will discuss some great tips and tricks to improve the performance of your text classification model. These tricks are obtained from solutions of some of Kaggle’s top NLP competitions.

Namely, I’ve gone through:

and found a ton of great ideas.

Without much lag, let’s begin.

Dealing with larger datasets

One issue you might face in any machine learning competition is the size of your data set. If the size of your data is large, that is 3GB + for Kaggle kernels and more basic laptops you could find it difficult to load and process with limited resources. Here is the link to some of the articles and kernels that I have found useful in such situations.

Small datasets and external data

But, what can one do if the dataset is small? Let’s see some techniques to tackle this situation.

One way to increase the performance of any machine learning model is to use some external data frame that contains some variables that influence the predicate variable.

Let’s see some of the external datasets.

Data Exploration and Gaining insights

Data exploration always helps to better understand the data and gain insights from it. Before starting to develop machine learning models, top competitors always read/do a lot of exploratory data analysis for the data. This helps in feature engineering and cleaning of the data.

Data Cleaning

Data cleaning is one of the important and integral parts of any NLP problem. Text data always needs some preprocessing and cleaning before we can represent it in a suitable form.

Text Representations

Before we feed our text data to the Neural network or ML model, the text input needs to be represented in a suitable format. These representations determine the performance of the model to a large extent.

Contextual embeddings models

  • BERT Bidirectional Encoder Representations from Transformers
  • GPT
  • Roberta a Robustly Optimized BERT
  • Albert a Lite BERT for Self-supervised Learning of Language Representations
  • Distilbert a lighter version of BERT
  • XLNET

Modeling

Model architecture

Choosing the right architecture is important to develop a proper machine learning model, sequence to sequence models like LSTMs, GRUs perform well in NLP problems and is always worth trying. Stacking 2 layers of LSTM/GRU networks is a common approach.

Loss functions

Choosing a proper loss function for your NN model really enhances the performance of your model by allowing it to optimize well on the surface.

You can try different loss functions or even write a custom loss function that matches your problem. Some of the popular loss functions are

Optimizers

Callback methods

Callbacks are always useful to monitor the performance of your model while training and trigger some necessary actions that can enhance the performance of your model.

Evaluation and cross-validation

Choosing a suitable validation strategy is very important to avoid huge shake-ups or poor performance of the model in the private test set.

The traditional 80:20 split wouldn’t work for many cases. Cross-validation works in most cases over the traditional single train-validation split to estimate the model performance.

There are different variations of KFold cross-validation such as group k-fold that should be chosen accordingly.

Runtime tricks

You can perform some tricks to decrease the runtime and also improve model performance at the runtime.

Model ensembling

If you’re in the competing environment one won’t get to the top of the leaderboard without ensembling. Selecting the appropriate ensembling/stacking method is very important to get the maximum performance out of your models.

Let’s see some of the popular ensembling techniques used in Kaggle competitions:

Final thoughts

In this article, you saw many popular and effective ways to improve the performance of your NLP classification model. Hopefully, you will find them useful in your projects.

This article was originally posted on the Neptune blog. If you liked it, you may like it there :)

You can also find me tweeting @Neptune_ai or posting on LinkedIn about ML and Data Science stuff.