paint-brush
The Ten Must Read NLP/NLU Papers from the ICLR 2020 Conferenceby@neptuneAI_jakub
514 reads
514 reads

The Ten Must Read NLP/NLU Papers from the ICLR 2020 Conference

by neptune.ai Jakub CzakonJuly 24th, 2020
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

The Ten Must Read NLP/NLU Papers from the ICLR 2020 Conference. The Ten must-read posts are a series of blog posts summarizing the best of them in four main areas. The International Conference on Learning Representations (ICLR) took place last week. This year the event was a bit different as it went virtual, but the online format didn’t change the great atmosphere of the event. Here, I want to share more information about 10 best Natural Language Processing/Understanding contributions.

Companies Mentioned

Mention Thumbnail
Mention Thumbnail
featured image - The Ten Must Read NLP/NLU Papers from the ICLR 2020 Conference
neptune.ai Jakub Czakon HackerNoon profile picture

The International Conference on Learning Representations (ICLR) took place last week, and I had a pleasure to participate in it. ICLR is an event dedicated to research on all aspects of representation learning, commonly known as deep learning. This year the event was a bit different as it went virtual. However, the online format didn’t change the great atmosphere of the event. It was engaging and interactive and attracted 5600 attendees (twice as many as last year). If you’re interested in what organizers think about the unusual online arrangement of the conference, you can read about it here.

Over 1300 speakers presented many interesting papers, so I decided to create a series of blog posts summarizing the best of them in four main areas. You can catch up with the first post with the best deep learning papers here, the second post with reinforcement learning papers here, and the third post with generative models papers here.

Here, I want to share more information about 10 best Natural Language Processing/Understanding contributions from the ICLR:

  1. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
  2. A Mutual Information Maximization Perspective of Language Representation Learning
  3. Mogrifier LSTM
  4. High Fidelity Speech Synthesis with Adversarial Networks
  5. Reformer: The Efficient Transformer
  6. DeFINE: Deep Factorized Input Token Embeddings for Neural Sequence Modeling
  7. Depth-Adaptive Transformer
  8. On Identifiability in Transformers
  9. Mirror-Generative Neural Machine Translation
  10. FreeLB: Enhanced Adversarial Training for Natural Language Understanding

1. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

A new pretraining method that establishes new state-of-the-art results on the GLUE, RACE, and SQuAD benchmarks while having fewer parameters compared to BERT-large.
(TL;DR, from OpenReview.net)

Paper | Code

The L2 distances and cosine similarity (in terms of degree) of the input and output embedding of each layer for BERT-large and ALBERT-large.
(source: figure 1, from the
paper).

First author: Zhenzhong Lan (LinkedIn)

2. A Mutual Information Maximization Perspective of Language Representation Learning

Word representation is a common task in NLP. Here, authors formulate new frameworks that combine classical word embedding techniques (like Skip-gram) with more modern approaches based on contextual embedding (BERT, XLNet).

Paper

The left plot shows F1 scores of BERT-NCE and INFOWORD as we increase the percentage of training examples on SQuAD (dev). The right plot shows F1 scores of INFOWORD on SQuAD (dev) as a function of λDIM.
(source: figure 1, from the
paper).

First author: Lingpeng Kong

Twitter | GitHub | Website

3. Mogrifier LSTM

An LSTM extension with state-of-the-art language modelling results.
(TL;DR, from OpenReview.net)

Paper

Mogrifier with 5 rounds of updates. The previous state h0 = hprev is transformed linearly (dashed arrows), fed through a sigmoid and gates x −1 = x in an elementwise manner producing x1 . Conversely, the linearly transformed x1 gates h 0 and produces h2 . After a number of repetitions of this mutual gating cycle, the last values of h∗ and x∗ sequences are fed to an LSTM cell. The prev subscript of h is omitted to reduce clutter.
(source: figure 1, from the
paper)

First author: Gábor Melis

Twitter | LinkedIn | GitHub | Website

4. High Fidelity Speech Synthesis with Adversarial Networks

We introduce GAN-TTS, a Generative Adversarial Network for Text-to-Speech, which achieves Mean Opinion Score (MOS) 4.2.
(TL;DR, from OpenReview.net)

Paper | Code

Residual blocks used in the model. Convolutional layers have the same number of input and output channels and no dilation unless stated otherwise. h — hidden layer representation, l — linguistic features, z — noise vector, m — channel multiplier, m = 2 for downsampling blocks (i.e. if their downsample factor is greater than 1) and m = 1 otherwise, M- G’s input channels, M = 2N in blocks 3, 6, 7, and M = N otherwise; size refers to kernel size.
(source: figure 1, from the
paper).

First author: Mikołaj Bińkowski

LinkedIn | GitHub

5. Reformer: The Efficient Transformer

Efficient Transformer with locality-sensitive hashing and reversible layers
(TL;DR, from OpenReview.net)

Paper | Code

An angular locality sensitive hash uses random rotations of spherically projected points to establish buckets by an argmax over signed axes projections. In this highly simplified 2D depiction, two points x and y are unlikely to share the same hash buckets (above) for the three different angular hashes unless their spherical projections are close to one another (below).
(source: figure 1, from the
paper).

Main authors:

Nikita Kitaev

LinkedIn | GitHub | Website


Łukasz Kaiser

Twitter | LinkedIn | GitHub

6. DeFINE: Deep Factorized Input Token Embeddings for Neural Sequence Modeling

DeFINE uses a deep, hierarchical, sparse network with new skip connections to learn better word embeddings efficiently.
(TL;DR, from OpenReview.net)

Paper

With DeFINE, Transformer-XL learns input (embedding) and output (classification) representations in low n-dimensional space rather than high m-dimensional space, thus reducing parameters significantly while having a minimal impact on the performance.
(source: figure 1, from the
paper).

First author: 

Twitter | LinkedIn | GitHub | Website

7. Depth-Adaptive Transformer

Sequence model that dynamically adjusts the amount of computation for each input.
(TL;DR, from OpenReview.net)

Paper

Training regimes for decoder networks able to emit outputs at any layer. Aligned training optimizes all output classifiers Cn simultaneously assuming all previous hidden states for the current layer are available. Mixed training samples M paths of random exits at which the model is assumed to have exited; missing previous hidden states are copied from below.
(source: figure 1, from the
paper).

First author: Maha Elbayad

Twitter | LinkedIn | GitHub | Website

8. On Identifiability in Transformers

We investigate the identifiability and interpretability of attention distributions and tokens within contextual embeddings in the self-attention based BERT model.
(TL;DR, from OpenReview.net)

Paper

(a) Each point represents the Pearson correlation coefficient of effective attention and raw attention as a function of token length. (b) Raw attention vs. (c) effective attention, where each point represents the average (effective) attention of a given head to a token type.
(source: figure 1, from the
paper).

First author: Gino Brunner

Twitter | LinkedIn | Website

9. Mirror-Generative Neural Machine Translation

Translation approaches known as Neural Machine Translation models (NMT), depend on availability of large corpus, constructed as a language pair. Here, a new method is proposed for translations in both directions using generative neural machine translation.

Paper

The graphical model of MGNMT.
(source: figure 1, from the
paper).

First author: Zaixiang Zheng

Twitter | Website

10. FreeLB: Enhanced Adversarial Training for Natural Language Understanding

Here, the authors propose a new algorithm, called FreeLB that formulate a novel approach to the adversarial training of the language model is proposed.

Paper | Code

Algorithm's pseudo-code.
(source: figure 1, from the paper).

First author: Chen Zhu

LinkedIn | GitHub | Website

Summary

Depth and breadth of the ICLR publications is quite inspiring. This post focuses on the “Natural Language Processing” topic, which is one of the main areas discussed during the conference. According to this analysis, these areas include:

  1. Deep learning (here)
  2. Reinforcement learning (here)
  3. Generative models (here)
  4. Natural Language Processing/Understanding (covered here)

In order to create a more complete overview of the top papers at ICLR, we have built a series of posts, each focused on one topic mentioned above. This is the last one, so you may want to check the others for a more complete overview.

We would be happy to extend our list, so feel free to share other interesting NLP/NLU papers with us.

In the meantime — happy reading!

This article was originally written by Kamil and posted on the Neptune blog.