The third blog post in the 5-minute Papers series. This blog post is the third one in the 5-minute Papers series. In this post, I’ll give highlights from the Paper by Jason Wei et al. “EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks” This paper, as the name suggests uses 4 simple ideas to perform data augmentation on NLP datasets. The authors have argued that their approach is much less computationally expensive and has a higher performance gain. Context: Data Augmentation: When you’re training a Machine model, your model learns features from your dataset. Learning However, if you have a relatively small dataset (related to what the given model would need to be “trained”), your ML model might start to memorize features very specific to your specific dataset, this is known as overfitting. To avoid overfitting, you can collect more data-which can pretty challenging. Or you can “augment” your current dataset to artificially add more data (free of cost). This is a well-practiced technique when you are working with Image data. However, with Natural Language, are not as well developed as Image domain. data augmentations Why? Image Augmentations does not require your code to have domain-specific knowledge, its easy to crop your image or flip it. To create more language data, which is similar to your current dataset, while retaining the original dataset details is a challenging task. A few of the techniques that I’ll save for a future blog post are: Translating sentences from one language and then back to the original. Predictive language models to do synonym replacement. The authors argue these are computationally expensive tasks and introduce 4 simple tasks, claiming a boost in accuracy when using EDA on “small datasets”. Approach Randomly replace n words in the sentences with their synonyms. Synonym Replacement (SR): Insert random synonyms of words in a sentence, this is done n times. Random Insertion (RI): Two words in the sentences are randomly swapped, this is repeated n-times. Random Swap (RS): Random removal for each word in the sentence with a probability p. Random Deletion (RD): The formula used to determine the number of sentences augmented is: N = Alpha * Length of the sentence. Alpha is the “augmentation parameter”, higher the alpha-more aggressive the “EDA”. Results: These techniques are compared on small subsets of text classification datasets using a simple CNN based and RNN architecture (Details of the architectures aren’t interesting for this writeup). The comparisons are done on the model accuracies with and without EDA. The EDA techniques are most helpful where the training set has 500 examples. The difference is less pronounced on bigger subsets. However, a slight improvement is consistent. Plot comparison of performance gain against the number of augmentations applied across different subsets Conclusion & Personal thoughts: The techniques introduced are easy to introduce to any dataset, the authors have made the source code available as well. One might argue that “EDA” might change the original sentences. Authors have done a quick check, drawn Latent space visualization of original and augmented sentences in the Pro-Con dataset to prove that is not the case. I think it might be an interesting experiment to “enable” EDA during the initial epochs of training or to start training with a subset of the datasets with EDA enabled and progressively making it less aggressive while increasing the subset portion that we’re training on. Personally, I’m trying to run these experiments on sentiment classification. It’s a joint experiment by and myself. Rishi Bhalodia Here is the starter kernel to our approach. As with all of my experiments, it will be based on the fastai library, you can expect a fully transparent study in future blog posts. If you’re interested in contributing, please join the discussion on fastai forums here. If you found this interesting and would like to be a part of My Learning Path , you can find me on Twitter here . If you’re interested in reading about Deep Learning and Computer Vision news, you can check out my newsletter here . If you’re interested in reading a few best advice from Machine Learning Heroes: Practitioners, Researchers, and Kagglers. Please click here

Twitter

Interview with Senior Research Scientist at the US Naval Research Laboratory: Dr. Leslie Smith

“An Embarrassingly Simple Approach for Transfer Learning from Pretrained Language Models”: Paper…

Connect with me on Twitter

“EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks”…

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

A Full Time ML Role, 1 Million Blog Views, 10k Podcast Downloads: A Community Taught ML Engineer

The Noonification: Use This 7-Step McKinsey Framework to Solve Any Problem (1/10/2023)

The Noonification: A Taxonomy of Inclusiveness (1/11/2024)

The Noonification: What is the InfiniteNature-Zero AI Model? (11/19/2022)

10 Ways AI Has Changed Our Lives

100 Days of AI, Day 8: Experimenting With Microsoft's Semantic Kernel Using GPT-4

A Full Time ML Role, 1 Million Blog Views, 10k Podcast Downloads: A Community Taught ML Engineer

The Noonification: Use This 7-Step McKinsey Framework to Solve Any Problem (1/10/2023)

The Noonification: A Taxonomy of Inclusiveness (1/11/2024)

The Noonification: What is the InfiniteNature-Zero AI Model? (11/19/2022)

10 Ways AI Has Changed Our Lives

100 Days of AI, Day 8: Experimenting With Microsoft's Semantic Kernel Using GPT-4

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps