In this article (originally posted by on the ), I will discuss some great tips and tricks to improve the performance of your text classification model. These tricks are obtained from solutions of some of Kaggle’s top NLP competitions. Shahul ES Neptune blog Namely, I’ve gone through: — $65,000 Jigsaw Unintended Bias in Toxicity Classification — $35,000 Toxic Comment Classification Challenge — $25,000 Quora Insincere Questions Classification — $25,000 Google QUEST Q&A Labeling — $50,000 TensorFlow 2.0 Question Answering and found a ton of great ideas. Without much lag, let’s begin. Dealing with larger datasets One issue you might face in any machine learning competition is the size of your data set. If the size of your data is large, that is 3GB + for Kaggle kernels and more basic laptops you could find it difficult to load and process with limited resources. Here is the link to some of the articles and kernels that I have found useful in such situations. Optimize the memory by reducing the size of some attributes Use open-source libraries such as , it performs parallel computing and saves up memory space Dask to read and manipulate the data Use cudf Convert data to format parquet Convert data to format feather Small datasets and external data But, what can one do if the dataset is small? Let’s see some techniques to tackle this situation. One way to increase the performance of any machine learning model is to use some external data frame that contains some variables that influence the predicate variable. Let’s see some of the external datasets. Use of data for Question Answering tasks squad Other for QA tasks datasets Wikitext long term dependency language modeling dataset Stackexchange data Prepare a dictionary of commonly misspelled words and corrected words. Use of for cleaning helper datasets is the process of adding confidently predicted test data to your training dataUse different data Pseudo labeling sampling methods Text augmentation by Exchanging words with synonyms Text augmentation by noising in RNN Text augmentation by translation to other languages and back Data Exploration and Gaining insights Data exploration always helps to better understand the data and gain insights from it. Before starting to develop machine learning models, top competitors always read/do a lot of exploratory data analysis for the data. This helps in feature engineering and cleaning of the data. Twitter data exploration methods Simple EDA for tweets for Quora data EDA in R for Quora data EDA Complete with stack exchange data EDA My previous article on EDA for natural language processing Data Cleaning Data cleaning is one of the important and integral parts of any NLP problem. Text data always needs some preprocessing and cleaning before we can represent it in a suitable form. Use this to clean social media data notebook for BERT Data cleaning Use to correct misspellings textblob for pre-trained embeddings Cleaning for multilingual tasks Language detection and translation Preprocessing for Glove and part 1 part 2 to get more from pre-trained word embeddings Increasing word coverage Text Representations Before we feed our text data to the Neural network or ML model, the text input needs to be represented in a suitable format. These representations determine the performance of the model to a large extent. Pretrained vectors Glove Pretrained vectors fasttext Pretrained vectors word2vec My previous article on these 3 embeddings Combining . This can help in better representation of text and decreasing OOV words pre-trained vectors embeddings Paragram Universal Sentence Encoder Use USE to generate sentence-level features 3 methods to combine embeddings Contextual embeddings models Bidirectional Encoder Representations from Transformers BERT GPT a Robustly Optimized BERT Roberta a Lite BERT for Self-supervised Learning of Language Representations Albert a lighter version of BERT Distilbert XLNET Modeling Model architecture Choosing the right architecture is important to develop a proper machine learning model, sequence to sequence models like LSTMs, GRUs perform well in NLP problems and is always worth trying. Stacking 2 layers of LSTM/GRU networks is a common approach. Stacking Bidirectional CuDNNLSTM Stacking LSTM networks LSTM and 5 fold Attention Bidirectional LSTM with 1D convolutions Unfreeze and tune embeddings BiLSTM with Global maxpooling Attention weighted average GRU+ Capsule network I nceptionCNN with flip Plain vanilla network with BERT CuDNNGRU network TextCNN with pooling layers BERT embeddings with LSTM Multi-sample dropouts Siamese transformer network Global Average pooling of hidden layers BERT Different Bert based models Distilling BERT — BERT performance using Logistic Regression Different learning rates among the layers of BERT Finetuning Bert for text classification Loss functions Choosing a proper loss function for your NN model really enhances the performance of your model by allowing it to optimize well on the surface. You can try different loss functions or even write a custom loss function that matches your problem. Some of the popular loss functions are for binary classification Binary cross-entropy for multi-class classification Categorical cross-entropy used for unbalanced datasets Focal loss for multilabel classification Weighted focal loss for multiclass classification Weighted kappa to get sigmoid cross-entropy BCE with logit loss Custom used in bias classification competition mimic loss Jigsaw unintended used in bias classification competition MTL custom loss jigsaw unintended Optimizers Stochastic gradient descent RMSprop allows the learning rate to adapt based on parameters Adagrad for fast and easy convergence Adam to enable warmup state to Adam algorithm Adam with warmup for Bert based models Bert Adam for stabilizing training and accelerating convergence Rectified Adam Callback methods Callbacks are always useful to monitor the performance of your model while training and trigger some necessary actions that can enhance the performance of your model. for monitoring and saving weights Model checkpoint to change the learning rate based on model performance to help converge easily Learning rate scheduler Simple custom callbacks using lambda callbacks Custom Checkpointing Building your for various use cases custom callbacks to reduce the learning rate when a metric has stopped improving Reduce on plateau to stop training when the model stops improving Early Stopping to get a variety of model checkpoints in one training Snapshot ensembling Fast geometric ensembling Stochastic Weight Averaging (SWA) Dynamic learning rate decay Evaluation and cross-validation Choosing a suitable validation strategy is very important to avoid huge shake-ups or poor performance of the model in the private test set. The traditional 80:20 split wouldn’t work for many cases. Cross-validation works in most cases over the traditional single train-validation split to estimate the model performance. There are different variations of KFold cross-validation such as group k-fold that should be chosen accordingly. K-fold cross-validation Stratified KFold cross-validation Group KFold to check if train and test distributions are similar or not Adversarial validation CV analysis of different strategies Runtime tricks You can perform some tricks to decrease the runtime and also improve model performance at the runtime. to save runtime and improve performance Sequence bucketing when the input sentence is larger than 512 tokens Get sentences from its head and tail Use the GPU efficiently Free keras memory to save runtime and memory Save and load models Don’t Save Embedding in RNN Solutions Load without key vectors word2vec vectors Model ensembling If you’re in the competing environment one won’t get to the top of the leaderboard without ensembling. Selecting the appropriate ensembling/stacking method is very important to get the maximum performance out of your models. Let’s see some of the popular ensembling techniques used in Kaggle competitions: Weighted average ensemble Stacked generalization ensemble Out of folds predictions Blending with linear regression Use to determine blending weights optuna Power average ensemble Power 3.5 blending strategy Final thoughts In this article, you saw many popular and effective ways to improve the performance of your NLP classification model. Hopefully, you will find them useful in your projects. This article was originally posted on the Neptune blog. If you liked it, you may like it there :) You can also find me tweeting @Neptune_a i or posting on LinkedIn about ML and Data Science stuff.