Recent years have seen a plethora of pre-trained models such as , , ,  etc being open-sourced to the NLP community. Given the size of such humungous models, it's nearly impossible to train such networks from scratch considering the amount of data and computation that is required.  This is where a new learning paradigm "Transfer Learning" kicks in. Transfer learning is a research problem in machine learning that focuses on storing knowledge gained while solving one problem and applying it to a different but related problem. ULMFiT BERT GPT The idea is to use pre-trained network weights and fine-tunes it for some specific task at hand. The fact we wish to utilize the network weights requires it to initially train on a very large high-quality corpus for learning language structure, grammar, and semantics. Most of the existing models such as ULMFiT, GPT were pre-trained with the objective on and . Whereas, BERT, on the other hand, was trained with MLM (Masked Language Model) objective. Language Model Wikipedia Google News dataset Later in this post, we will see what MLM is and how T5 is also trained with a similar objective with little tweaks for generalizability. Just to make sure everyone is on same page, a is a model that looks at historical parts of sentence and predicts the next word in the sentence. Language Model Machine Learning model proposes reframing all NLP tasks into a unified text-to-text-format  where the input and output are always text strings. This formatting makes one T5 model fit for multiple tasks. As can be seen in the featured animation that it takes in text input from left for various NLP  tasks and outputs the text for that respective task. We will see more about how the model was trained and all in the below sections. T5: Text-to-Text-Transfer-Transformer Before that, I wanted to discuss the data that was used to pre-train the model. The authors have named it (Colossal Clean Crawled Corpus). It's approximately 700GB in size and is the cleaned version dataset. The authors have mentioned the cleaning in sense of extracting only English text, removing code lines, deduplicating, etc. It's a high quality pre-processed English language corpus that they have made available for download. C4 Common Crawl Also, the T5 model, pre-trained on C4, achieves state-of-the-art results on many NLP benchmarks while being flexible enough to be fine-tuned to a variety of important downstream tasks. Training Objective T5 also trains with the same objective as that of BERT's which is the Masked Language Model with a little modification to it. Masked Language Models are , at any time t the representation of the word is derived from both left and the right context of it. The subtle difference that T5 employs is to replace multiple consecutive tokens with a single Mask keyword, unlike, BERT that uses Mask token for each word. Bidirectional models As you can see from the above diagram, the Original text is transformed into Input and Output pairs by adding perturbations to it. Since the final objective is to have trained a model that inputs text and outputs text, the targets were designed to produce a sequence, unlike BERT, that tries to output one word (itself) through final and at the output level. feed-forward softmax The model was trained on the C4 corpus (mentioned above) with the same objective as a part of the pre-training. It was then finetuned on various tasks such as , , , etc. Fine-tuning was done by showing model the I/O text pairs with task-specific prefix-text added to each input. For example - translate English to German: <text>, adding such a prefix enabled the model to tune it's weight for a particular task in-hand and would only produce the expected output for that task alone by narrowing its scope of generation. Language Translation Summarization Sentence Similarity All the tasks essentially share the same objective, training procedure, and decoding process. The authors also claim that they did not find any single case where the model got confused and outputted something totally random or expected output of another task. One quiet interesting thing was that they even modeled regression tasks such as sentence similarity also as a text generation objective. To reduce the scope of real numbers, they generated a number between 0 and 5 with 0.2 quantization, which means, the model could only produce numbers at 0.2 difference, for example - 3.2, 3.4, 3.6, etc. Training level specifics such as LR schedule, tokenization, sequence length, etc can be read in detail under the 3.1.2. Training section. The authors conducted extensive and testing across various tasks. The below diagram shows the tuning at various levels - hyper-parameter tuning 1. - They tried typical auto-regressive style language modeling objective, BERT style Masked Language Model objective, and Deshuffling denoising objective. They found BERT style (missing context prediction) as the best bet for pre-training the model. Pre-training Style 2. - They experimented with 3 types of corruption strategy, Masking a random word, Masking a span (more than 1 consecutive words), and dropping a word from the input. Considering the task type in hand, which is, both I/O are text strings, Corrupting the span worked best for them. Corruption Scheme 3. - After playing around with different corruption rates they found all of them to be performing almost the same, of which 15% was a little better. Corruption Rate 4. - They also experimented with the different  corruption span lengths and found that the more the span length the worse the model performed, also which seems to be true, considering span length equal to the length of sentence would mean model to produce the text from empty input giving it the flexibility to have high variability. Corruption Length I would also encourage the reader to read to understand takeaways while training the model. Page 32 : Reflection Demo This section will focus on doing inference on the pre-trained T5 model. All the code has been committed to . Feel free to clone and play around. Also, don't forget to star the repo, in case you liked it. Github: Text-to-Text-Transfer-Transformer References Google AI Blog - T5 https://arxiv.org/pdf/1910.10683.pdf Thanks for your time. Previously published at https://prakhartechviz.blogspot.com/2020/04/t5-text-to-text-transfer-transformer.html

Exploring T5 Model : Text to Text Transfer Transformer Model

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

8 Open-source NLP Tools You Should Try

A Brief Into to NLP in the Media & Communication Industry

How Natural Language Processing Companies Are Transforming SEO Strategies

Incorporating NLP Capabilities Into an Existing Application Stack Is Easier Than Ever: Here's Why

My Experiments With AI Poetry And Some Random Thoughts

The Landscape of AI in African Languages and Linguistics

8 Open-source NLP Tools You Should Try

A Brief Into to NLP in the Media & Communication Industry

How Natural Language Processing Companies Are Transforming SEO Strategies

Incorporating NLP Capabilities Into an Existing Application Stack Is Easier Than Ever: Here's Why

My Experiments With AI Poetry And Some Random Thoughts

The Landscape of AI in African Languages and Linguistics

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps