Transfer Learning for Natural Language Processing

Written by pranoyradhakrishnan | Published 2018/02/17
Tech Story Tags: machine-learning | naturallanguageprocessing | artificial-intelligence | transfer-learning | nlp

TLDRvia the TL;DR App

Transfer learning is aimed to make use of valuable knowledge in a source domain to help model performance in a target domain.

Why do we need Transfer Learning for NLP?

In NLP applications, especially when we do not have large enough datasets for solving a task(called the target task T ), we would like to transfer knowledge from other tasks S to avoid overfitting and to improve the performance of T.

Two Scenarios

Transferring knowledge to a semantically similar/same task but with a different dataset.

  • Source task (S)-A Large dataset for binary sentiment classification
  • Target task (T)- A small dataset for binary sentiment classification

Transferring knowledge to a task that is semantically different but shares the same neural network architecture so that neural parameters can be transferred.

  • Source task (S)- A large dataset for binary sentiment classification
  • Target task (T) - A small dataset for 6-way question classification (e.g., location, time, and number)

Transfer Methods

Parameter initialization (INIT).

The INIT approach first trains the network on S, and then directly uses the tuned parameters to initialize the network for T . After transfer, we may fix the parameters in the target domain.i e fine tuning the parameters of T.

Multi-task learning (MULT)

MULT, on the other hand, simultaneously trains samples in both domains.

Multi Task Learning

Combination (MULT+INIT)

We first pretrain on the source domain S for parameter initialization, and then train S and T simultaneously.

Model Performance on

Parameter initialization (INIT) , MULT and MULT+INIT

  • Transfer learning of semantically equivalent tasks appears to be successful.
  • There is no big improvement for semantically different tasks.

Conclusion

The Neural Transfer Learning in NLP depends largely on how similar in semantics the source and target datasets are.

Reference

https://arxiv.org/abs/1603.06111


Published by HackerNoon on 2018/02/17