The fourth blog post in the 5-minute Papers series. This blog post is the fourth one in the 5-minute Papers series. In this post, I’ll give highlights from the Paper by Matthew Peters et al, “To Tune or Not to Tune? Adapting Pretrained Representations to Diverse Tasks” Sebastian Ruder et al This paper focuses on sharing the best methods to “adapt” your pretrained model to the target task. (answering the question: “To Tune or Not to Tune?”) Context: Transfer is the of transferring “knowledge” from a pre-trained model. The original model might have been trained on a slightly different task as compared to our target task. Target task is the task that we are trying to solve. For ex: Performing Text Classification, etc. learning methodology Let’s consider a quick example for better understanding, let’s assume we are trying to perform text classification. Instead of training a model to perform just that, it has been shown one could train a Language Model and then fine-tune it to perform text classification. How is this helpful? The “knowledge” of the English language captured by the Language Model(LM) is helpful for the model. Note that this is already a standard practice with Computer Vision related tasks. Adaptation Back to the original discussion, transfer learning method has 2 steps: (Pre Training) -> Adaptation Adaptation is when we’re using our pre-trained model to “adapt” to the target task. Two possible approaches to performing Adaptation are: The pre-trained features are kept frozen (The authors have used the ❄️ emoji to denote this) and these extracted features are used in the target model. Feature extraction: In this approach, the model weights are trained further on the target/new task. (The authors have used the 🔥emoji to denote this) Fine Tuned: Approach The authors compare ELMo and BERT as the base architectures since these are one of the best performers. These are evaluated on 5 different tasks, utilizing several standard datasets and a comparative study of Vs is discussed: Named entity recognition (NER) Sentiment analysis (SA), and Three sentence pair tasks Natural language inference (NLI) Paraphrase detection (PD) Semantic textual similarity (STS) The complete experimental details are shared in the last section of the paper, two tricks mentioned in Fine-Tuning worthy a quick mention are: Discriminative learning rates:The learning rates are set differently for the layers, these are decreased as 0.4 * learning rate (of the outer layer). Gradual Unfreezing: Starting with the top layer, in each epoch one additional layer of weights is unfrozen until all weights are training. Results The results of these experiments across 7 different datasets representing different fundamental tasks are summarized in the table shown below. The best approach is shown by either a blue (❄️) or red (🔥) color to denote the best performer for the respective case Conclusions and Thoughts The paper compares the effectiveness of two specific approaches, heavily marked by Emojis (Personal Note: I found it pretty cool to see Emojis in a paper). The paper also shares a quick practical guide for NLP Practitioners: There are also extensive analyses on several tasks: Modelling pairwise interactions Impact of additional parameters ELMo fine-tuning Impact of target domain Representations at different layers Both, Feature extraction being the “cheaper” or less computationally expensive option as well as Fine-Tuning, often being the better performer as it allows better adaptation of a pretrained model are important approaches. To conclude, The paper does an extensive comparison across tasks and answers an important question. The Practitioner guideline is also an amazing table for quick reference. If you found this interesting and would like to be a part of My Learning Path , you can find me on Twitter here . If you’re interested in reading about Deep Learning and Computer Vision news, you can check out my newsletter here . If you’re interested in reading a few best advice from Machine Learning Heroes: Practitioners, Researchers, and Kagglers. Please click here

Target

Twitter

Interview with Kaggle Grandmaster, Senior CV Engineer at Lyft: Dr. Vladimir I. Iglovikov

Interview with Senior Research Scientist at the US Naval Research Laboratory: Dr. Leslie Smith

Connect with me on Twitter

Nominated for 2022 - HackerNoon Contributor of the Year - Deep Learning

Nominated for 2022 - HackerNoon Contributor of the Year - Machine Learning

Nominated for 2022 - HackerNoon Contributor of the Year - Computer Science

Too Long; Didn't Read

“To Tune or Not to Tune? Adapting Pretrained Representations to Diverse Tasks”: Paper Discussion

“To Tune or Not to Tune? Adapting Pretrained Representations to Diverse Tasks”: Paper Discussion

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

A Full Time ML Role, 1 Million Blog Views, 10k Podcast Downloads: A Community Taught ML Engineer

The Noonification: Use This 7-Step McKinsey Framework to Solve Any Problem (1/10/2023)

The Noonification: A Taxonomy of Inclusiveness (1/11/2024)

The Noonification: What is the InfiniteNature-Zero AI Model? (11/19/2022)

10 Ways AI Has Changed Our Lives

100 Days of AI, Day 8: Experimenting With Microsoft's Semantic Kernel Using GPT-4

A Full Time ML Role, 1 Million Blog Views, 10k Podcast Downloads: A Community Taught ML Engineer

The Noonification: Use This 7-Step McKinsey Framework to Solve Any Problem (1/10/2023)

The Noonification: A Taxonomy of Inclusiveness (1/11/2024)

The Noonification: What is the InfiniteNature-Zero AI Model? (11/19/2022)

10 Ways AI Has Changed Our Lives

100 Days of AI, Day 8: Experimenting With Microsoft's Semantic Kernel Using GPT-4

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps