More data we have, better performance we can achieve. However, it is very too luxury to annotate large amount of training data. Therefore, proper data augmentation is useful to boost up your model performance. Authors of (Xie et al., 2019) proposed Unsupervised Data Augmentation (UDA) assistants us to build a better model by leveraging several data augmentation methods. Unsupervised Data Augmentation In natural language processing (NLP) field, it is hard to augmenting text due to high complexity of language. Not every word we can replace it by others such as a, an, the. Also, not every word has synonym. Even changing a word, the context will be totally difference. On the other hand, generating augmented image in computer vision area is relative easier. Even introducing noise or cropping out portion of image, model can still classify the image. Xie et al. conducted several data augmentation experiments on image classification (AutoAugment) and text classification (Back translation and TF-IDF based word replacing). After generating large enough data set of model training, authors noticed that model can easily over-fit. Therefore, they introduce Training Signal Annealing (TSA) to overcome it. Augmentation Strategies This section will introduce three data augmentation in computer vision (CV) and natural language processing (NLP) field. AutoAugment for Image Classification AutoAugment is found by google in 2018. It is a way to augment image automatically. Unlike traditional image augmentation library, AutoAugment is designed to find the best policy to manipulate data automatically. You may visit for model and implementation. here Back translation for Text Classification Back translation is a method to leverage translation system to generate data. Given that we have a model for translating English to Cantonese and visa versa. Augmented data can be retrieved by translating the original data from English to Cantonese and then translating back to English used back-translation method to generate more training data to improve translation model performance. Sennrich et al. (2015) TF-IDF based word replacing for Text Classification Although back translation helps to generate lot of data, there is no guarantee that keywords will be kept after translation. Some keywords carries more information than others and it may be missed after translation. Therefore, Xie et al. uses to tackle this limitation. The concept of TF-IDF is that high frequency may not able to provide much information gain. In another word, rare words contribute more weights to the model. Word importance will be increased if the number of occurrence within same document (i.e. training record). On the other hand, it will be decreased if it occurs in corpus (i.e. other training records). TF-IDF IDF score is calculated by DBPedia corpus. TF-IDF score will be computed for each token and replacing it according to TF-IDF score. Low TF-IDF score will have a high probability to be replaced. If you are interested to use TF-IDF based word replacing for data augmentation, you may visit for python implementation. nlpaug Training Signal Annealing (TSA) After generated large amount of data by using aforementioned skill, Xie et al. noticed  that  model will be over-fitting easily. Therefore, they introduce TSA. During model training, examples with high confidence will be removed from loss function to prevent over-training. The following figure shows the value range of ηt while K is number of categories. If the probability is higher then ηt, it will be removed from loss function. 3 calculation of ηt are considered for different scenarios: Linear-schedule: Growing constantly Log-schedule: Growing faster in the early stage of training Exp-schedule: Growing faster at the end of training. Recommendation The above approach is designed to solve problems which authors are facing in their problem. If you understand your data, you should tailor made augmentation approach it. Remember that golden rule in data science is garbage in garbage out. Like to learn? I am Data Scientist in Bay Area. Focusing on state-of-the-art in Data Science, Artificial Intelligence , especially in NLP and platform related. Feel free to connect with on or . me LinkedIn Github Extension Reading Data Augmentation for Text Unofficial AutoAugment implementation Reference R. Sennrich, B. Haddow and A Birch. . 2015 Improving Neural Machine Translation Models with Monolingual Data E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan and Q. V. Le. . 2018 AutoAugment: Learning Augmentation Strategies from Data Q. Xie, Z. Dai, E Hovy, M. T. Luong and Q. V. Le. . 2019 Unsupervised Data Augmentation

Unsupervised Data Augmentation

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

The Noonification: Use This 7-Step McKinsey Framework to Solve Any Problem (1/10/2023)

The Noonification: A Taxonomy of Inclusiveness (1/11/2024)

The Noonification: What is the InfiniteNature-Zero AI Model? (11/19/2022)

10 Ways AI Has Changed Our Lives

100 Days of AI, Day 8: Experimenting With Microsoft's Semantic Kernel Using GPT-4

10 Top Open Source AI Technologies For Startups

The Noonification: Use This 7-Step McKinsey Framework to Solve Any Problem (1/10/2023)

The Noonification: A Taxonomy of Inclusiveness (1/11/2024)

The Noonification: What is the InfiniteNature-Zero AI Model? (11/19/2022)

10 Ways AI Has Changed Our Lives

100 Days of AI, Day 8: Experimenting With Microsoft's Semantic Kernel Using GPT-4

10 Top Open Source AI Technologies For Startups

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps