There are many tasks in NLP from text classification to question answering, but whatever you do the amount of data you have to train your model impacts the model performance heavily. What can you do to make your dataset larger? Simple option -> Get more data :) But acquiring and labeling additional observations can be an expensive and time-consuming process. What you can do instead? Apply data augmentation to your text data. Data augmentation techniques are used to Augmentation methods are super popular in computer vision applications but they are just as powerful for NLP. generate additional, synthetic data using the data you have. In this article, we’ll go through all the major data augmentation methods for NLP that you can use to increase the size of your textual dataset and improve your model performance. Data augmentation for computer vision vs NLP In computer vision applications, data augmentations are done almost everywhere to get larger training data and make the model generalize better. The main methods used involve: cropping flipping zooming rotation noise injection In , these transformations are using data generators. As a batch of data is fed to your neural network it is randomly transformed (augmented). You don’t need to prepare anything before training. computer vision done on the go This isn’t the case with , where data augmentation should be done carefully due to the grammatical structure of the text. The methods discussed here are used A new augmented dataset is generated beforehand and later fed into data loaders to train the model. NLP before training. Data Augmentation Methods In this article, I will mainly focus on NLP data augmentation methods provided in the following projects: Back translation. . EDA (Easy Data Augmentation) NLP Albumentation. NLP Aug. So, let’s dive into each of them. Back translation In this method, we translate the text data to some language and then translate it back to the original language. This can help to generate textual data with different words while preserving the context of the text data. Language translations APIs like google translate, Bing, Yandex are used to perform the translation. For example, given the sentence: Amit Chaudhary “Back Translation for Text Augmentation with Google Sheets” You can see that the sentences are not the same but their content remains the same after the back-translation. If you want to try this method for a dataset you can use this as reference. notebook Easy Data Augmentation Easy data augmentation uses traditional and very simple data augmentation methods. EDA consists of of preventing overfitting and helping train more robust models. four simple operations that do a surprisingly good job Synonym Replacement Randomly choose words from the sentence that are not stop words. Replace each of these words with one of its synonyms chosen at random. n For example, given the sentence: This article will focus on summarizing data augmentation techniques in NLP. The method randomly selects n words (say two), the words and , and replaces them with and respectively. article techniques write-up methods This write-up will focus on summarizing data augmentation methods in NLP. Random Insertion Find a random synonym of a random word in the sentence that is not a stop word. Insert that synonym into a random position in the sentence. Do this times. n For example, given the sentence: This article will focus on summarizing data augmentation techniques in NLP. The method randomly selects n words (say two), the words and find the synonyms as and respectively. Then these synonyms are inserted at a random position in the sentence. article techniques write-up methods This article will focus on write-up summarizing data augmentation techniques in NLP methods . Random Swap Randomly choose two words in the sentence and swap their positions. Do this times. n For example, given the sentence This article will focus on summarizing data augmentation techniques in NLP. The method randomly selects n words (say two), the words and and swaps them to create a new sentence. article techniques This techniques will focus on summarizing data augmentation article in NLP. Random Deletion Randomly remove each word in the sentence with probability . p For example, given the sentence This article will focus on summarizing data augmentation techniques in NLP. The method selects n words (say two), the words and , and removes them from the sentence. will techniques This article focus on summarizing data augmentation in NLP. You can go to this if you want to apply these techniques to your projects. repository NLP Albumentation Previously, we talked about differences between computer vision data augmentation and NLP data augmentation. But in this section, we will see how we can apply some of the ideas used in CV data augmentation in NLP. For that, we will use the package. Albumentations Let’s take a look at a couple of the techniques here. Shuffle Sentences Transform In this transformation, if the given text sample contains multiple sentences these sentences are shuffled to create a new sample. For example: text = ‘<Sentence1>. <Sentence2>. <Sentence4>. <Sentence4>. <Sentence5>. <Sentence5>.’ Is transformed to: text = ‘<Sentence2>. <Sentence3>. <Sentence1>. <Sentence5>. <Sentence5>. <Sentence4>.’ Exclude duplicate transform In this transformation, if the given text sample contains multiple sentences with duplicate sentences, these duplicate sentences are removed to create a new sample. For example given the sample: text = ‘<Sentence1>. <Sentence2>. <Sentence4>. <Sentence4>. <Sentence5>. <Sentence5>.’ We transform it to: ‘<Sentence1>. <Sentence2>.<Sentence4>. <Sentence5>.’ There are many other transformations which you can try with this library. You can check this wonderful to see the complete implementation. notebook NLPAug Library Until now we have discussed many methods by which data augmentation can be used in NLP. But effectively implementing these methods from scratch is a lot of work. In this section, I will introduce you to a and you can tune the level of augmentation you need using various arguments. python package that lets you do all these data augmentation easily helps you with augmenting NLP for your machine learning projects. Let’s see how we can use this library to perform data augmentation. NLPAug NLPAug offers three types of augmentation: Character level augmentation Word level augmentation Sentence level augmentation In each of these levels, NLPAug provides all the methods discussed in the previous sections such as: random deletion random insertion shuffling synonym replacement From my experience, the most commonly used and effective technique is synonym replacement via word embeddings. We replace n number words with its synonyms (word embeddings that are close to those words) to obtain a sentence with the same meaning but with different words. While performing synonym replacement we can choose which pre-trained embedding we should use to find the synonyms for a given word. With NLPaug we can choose non-contextual embeddings like: Glove word2vec or contextual embeddings like: Bert Roberta For example: aug = naw.ContextualWordEmbsAug( model_path= , action= ) augmented_text = aug.augment(text) 'bert-base-uncased' "insert" Original: The quick brown fox jumps over the lazy dog Augmented Text: even the quick brown fox usually jumps over the lazy dog Things to keep in mind while doing NLP Data Augmentation As I said in the introduction, there are certain things that we need to be careful of while doing augmentation in NLP. The is that algorithms, when done incorrectly, is that you heavily overfit the augmented training data. main issue faced when training on augmented data Some things to keep in mind: Do not validate using the augmented data. If you’re doing K-fold cross-validation, always keep the original sample and augmented sample in the same fold to avoid overfitting. Always try different augmentation approaches and check which works better. A mix of different augmentation methods is also appreciated but don’t overdo it. Experiment to determine the optimal number of samples to be augmented to get the best results. Keep in mind that data augmentation in NLP does not always help to improve model performance. Data Augmentation workflow In this section, we will try data augmentation on competition hosted on Kaggle. Real or Not? NLP with Disaster Tweets In , I used the data from this competition to try different non-contextual embedding methods. Here, I will use the very same classification pipeline I used there but I will add data augmentation to see if it improves the model performance. one of my previous posts First, let’s load the training dataset and check the target class distribution. … x=tweet.target.value_counts() sns.barplot(x.index,x) plt.gca().set_ylabel( ) 'samples' We can see that there is a small class imbalance here. Let’s generate some positive samples using the synonym replacement method. Before data augmentation, we split the data into the train and validation set so that no samples in the validation set have been used for data augmentation. train,valid=train_test_split(tweet,test_size= ) 0.15 Now, we can do data augmentation of the training dataset. I have chosen to generate 300 samples from the positive class. aug_w2v.aug_p=pr new_text=[] df_n=df[df.target== ].reset_index(drop= ) i tqdm(np.random.randint( ,len(df_n),samples)): text = df_n.iloc[i][ ] augmented_text = aug_w2v.augment(text) new_text.append(augmented_text) new=pd.DataFrame({ :new_text, : }) df=shuffle(df.append(new).reset_index(drop= )) df train = augment_text(train) : def augment_text (df,samples= ,pr= ) 300 0.2 ##selecting the minority class samples 1 True ## data augmentation loop for in 0 'text' ## dataframe 'text' 'target' 1 True return We can now use this augmented text data to train the model. So did data augmentation with synonym replacement work? in the model performance (AUC). With data augmentation, we got a good boost Playing with different techniques and tuning hyperparameters of the data augmentation methods can improve results even further but I will leave it for now. If you’d like to do that I prepared a notebook where you can play with things. Final thoughts In this article, we discussed and implemented different data augmentation methods for textual data. To my knowledge, these are the best publicly available techniques and packages to do the task. Hopefully, you will find them useful in your projects. This article was originally written by Shahul ES and posted on the Neptune blog where you can find more in-depth articles for machine learning practitioners.