This article introduces a mechanism that I personally use to read many papers quickly so that I can catch up with the ever-growing number of papers published in the Natural Language Processing (NLP) field.
Companies Mentioned
There are soooo many academic and industrial papers in the field of machine learning, natural language processing (NLP) and computer systems nowadays. And even in a single conference, it’s overwhelming. In this post, I’ll share one of the ways I do conference paper reading; I like to call it “paper blitz”.
In a paper blitz session, the main goal is to cover all the papers that I will find “interesting” at a really superficial level. And I really mean ALL. The general idea is to understand the “meta-trends” of the conference submissions and group papers under the same trend so that it gives a more holistic view of either the types of problems an approach can solve or the variations of an approach to solve a single problem type.
In a paper blitz session, the main goal is to cover all the papers that I will find “interesting” at a really superficial level. And I really mean ALL.
For each paper, the objective is to identify:
Meta-Trend
Problem
Approach
Quality
Strength
Weakness
Usefulness
Code and data availability
Natural extension
How to apply my work/interest =)
Before we continue, here’s a disclaimer. A paper blitz is not a usual paper reading session nor a deep/shallow dive into the papers. We might also mistakenly critic papers since we are doing a really superficial read on the papers. I’ll reiterate that the goal here is to recall how many papers we can cover in a short time and not the precision of how well we understand the papers. Additionally, I’ll recommend that we bookmark the papers that deserve deeper dive as a follow-up to the paper blitz.
A paper blitz is not a usual paper reading session nor a deep/shallow dive to the papers. We might also mistakenly critic papers since we are doing a really superficial read on the papers.
Filter the papers [15-20 mins]
The first act of the paper blitz is to filter out ~50 papers that we want to read in the blitz. This is usually the hardest part and there’s no easy way to do it and in this case. So, we resort to judging the paper by its title. Unsolicited advice to paper authors, make the paper title informative and interesting.
And now, はじめ…
From https://aclanthology.org/events/acl-2022/, I would normally go through every title one by one and copy and paste the title that I find interesting into a notepad. There will definitely be inherent bias to choose papers of your pet topics, famous authors or simply NLP friends you’ve yet to contact from the pre-covid days. Don’t fight the bias and just put them into your list, but if your whole list is made up of NLP friends’ paper, you either have too many friends or cut down on your list and force yourself to go through the anthology again. Iterate the process until your list is made up of ~50 paper.
I would strongly recommend that you resist the urge and not use CTR+F and also not pick up papers only based on a specific topic. The paper blitz would be more like a shopping discovery experience than an e-commerce search experience, think window-shopping scrolling through the e-commerce app and if you are at the conference physically while doing a paper blitz, think window shopping in a brick and mortar mall.
Here’s a pseudo-code for the filtering process of the paper blitz.
def is_numberish(paper_blitz, max_num=50, ish_margin=0.2):
return max_num * (1-ish_margin) < len(paper_blitz) < max_num * (1+ish_margin)
paper_blitz = []
while is_numberish(num_paper, 50):
for paper in acl_anthology:
if select_title(paper): # Use your own select_title func, as desired.
paper_blitz.append(paper)
Drumrolls… ここに… ただ!!
For ACL 2022, here’s my personal paper blitz list. You can see there’s definitely a bias to my pet topics:
Can Transformer be Too Compositional? Analysing Idiom Processing in Neural Machine Translation
The Paradox of the Compositionality of Natural Language: A Neural Machine Translation Case Study
Challenges and Strategies in Cross-Cultural NLP
Cross-Lingual Phrase Retrieval
Early Stopping Based on Unlabeled Samples in Text Classification
CLUES: A Benchmark for Learning Classifiers using Natural Language Explanations
DiBiMT: A Novel Benchmark for Measuring Word Sense Disambiguation Biases in Machine Translation
Distantly Supervised Named Entity Recognition via Confidence-Based Multi-Class Positive and Unlabeled Learning
Educational Question Generation of Children Storybooks via Question Type Distribution Learning and Event-centric Summarization
FairLex: A Multilingual Benchmark for Evaluating Fairness in Legal Text Processing
Imputing Out-of-Vocabulary Embeddings with LOVE Makes Language Models Robust with Little Cost
Identifying Moments of Change from Longitudinal User Text
Is Attention Explanation? An Introduction to the Debate
Measuring Fairness of Text Classifiers via Prediction Sensitivity
Meta-learning via Language Model In-context Tuning
Match the Script, Adapt if Multilingual: Analyzing the Effect of Multilingual Pretraining on Cross-lingual Transferability
mLUKE: The Power of Entity Representations in Multilingual Pretrained Language Models
QuoteR: A Benchmark of Quote Recommendation for Writing
Robust Lottery Tickets for Pre-trained Language Models
Sentence-level Privacy for Document Embeddings
ToxiGen: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection
UniTE: Unified Translation Evaluation
Universal Conditional Masked Language Pre-training for Neural Machine Translation
What Makes Reading Comprehension Questions Difficult?
Word Order Does Matter and Shuffled Language Models Know It
Word2Box: Capturing Set-Theoretic Semantics of Words using Box Embeddings
Machine Translation for Livonian: Catering to 20 Speakers
Kronecker Decomposition for GPT Compression
HYPHEN: Hyperbolic Hawkes Attention For Text Streams
As Little as Possible, as Much as Necessary: Detecting Over- and Undertranslations with Contrastive Conditioning
QiuNiu: A Chinese Lyrics Generation System with Passage-Level Input
Language Diversity: Visible to Humans, Exploitable by Machines
A Natural Diet: Towards Improving Naturalness of Machine Translation Output
Automatic Song Translation for Tonal Languages
Dict-BERT: Enhancing Language Model Pre-training with Dictionary
ELLE: Efficient Lifelong Pre-training for Emerging Data
Finding the Dominant Winning Ticket in Pre-Trained Language Models
First the Worst: Finding Better Gender Translations During Beam Search
Word-level Perturbation Considering Word Length and Compositional Subwords
Unsupervised Preference-Aware Language Identification
Are Prompt-based Models Clueless?
BERT Learns to Teach: Knowledge Distillation with Meta Learning
bert2BERT: Towards Reusable Pretrained Language Models
Better Language Model with Hypernym Class Prediction
Categorize the papers [25-30 mins]
The filtered list still looks like a mental overload but that is the goal of the paper blitz, to cover as many as possible. The next step in the process is to categorize the papers, and how I usually do it is to first put a category on the first paper and then see if the second paper fits into the same category, if not create a second category, then iterate till the end of list, repeat the categorization process from the first and see if the papers need to be reshuffled across the categories, recur until desired.
Here’s another psuedo-code for the categorization process:
cat_to_paper = {} # Categories to Papers mapping.
def categorize(paper):
max_sim = 0 # Variable to keep the maximum similarity
paper_category = None
for cat in cat_to_paper:
# cosine() and vectorize() are just illustrations proxy to our human brains.
# Cosine is a similarity function, proxy to how our brain relate things.
# Vectorize is a function to convert a paper into an abstract numerical vector.
similarity = cosine(vectorize(paper), vectorize(cat))
if similarity > max_sim:
max_sim = similarity
paper_category = cat
return paper_category
# There's no fix satisfaction criteria, you'll have to come up with your own.
while is_satisfied(cat_to_paper):
for paper in paper_blitz:
cat = categorize(paper)
cat_to_paper[cat] = paper
Presto! ほら!
Here’s a categorized version of my filtered list from ACL 2022:
Multi-Word Expression / Compositionality
Can Transformer be Too Compositional? Analysing Idiom Processing in Neural Machine Translation
The Paradox of the Compositionality of Natural Language: A Neural Machine Translation Case Study
Word-level Perturbation Considering Word Length and Compositional Subwords
Low-Resource Language / Problems
Challenges and Strategies in Cross-Cultural NLP
Imputing Out-of-Vocabulary Embeddings with LOVE Makes Language Models Robust with Little Cost
Machine Translation for Livonian: Catering to 20 Speakers
Language Diversity: Visible to Humans, Exploitable by Machines
ELLE: Efficient Lifelong Pre-training for Emerging Data
Multilinguality / Crosslingual NLP
Cross-Lingual Phrase Retrieval
Match the Script, Adapt if Multilingual: Analyzing the Effect of Multilingual Pretraining on Cross-lingual Transferability
mLUKE: The Power of Entity Representations in Multilingual Pretrained Language Models
Unsupervised / Semi-Supervised
Early Stopping Based on Unlabeled Samples in Text Classification
Distantly Supervised Named Entity Recognition via Confidence-Based Multi-Class Positive and Unlabeled Learning
Unsupervised Preference-Aware Language Identification
Datasets
CLUES: A Benchmark for Learning Classifiers using Natural Language Explanations
DiBiMT: A Novel Benchmark for Measuring WSD Biases in Machine Translation
FairLex: A Multilingual Benchmark for Evaluating Fairness in Legal Text Processing
QuoteR: A Benchmark of Quote Recommendation for Writing
ToxiGen: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection
Language Learning / Understanding Language
Educational Question Generation of Children Storybooks via Question Type Distribution Learning and Event-centric Summarization
What Makes Reading Comprehension Questions Difficult?
Word Order Does Matter and Shuffled Language Models Know It
Are Prompt-based Models Clueless?
Machine Translation (MT) Tricks / MT Evaluation
UniTE: Unified Translation Evaluation
As Little as Possible, as Much as Necessary: Detecting Over- and Undertranslations
A Natural Diet: Towards Improving Naturalness of Machine Translation Output
First the Worst: Finding Better Gender Translations During Beam Search
Applications
QiuNiu: A Chinese Lyrics Generation System with Passage-Level Input
Automatic Song Translation for Tonal Languages
Mukayese: Turkish NLP Strikes Back
Model Architectures / Optimization
Is Attention Explanation? An Introduction to the Debate
Meta-learning via Language Model In-context Tuning
Robust Lottery Tickets for Pre-trained Language Models
Finding the Dominant Winning Ticket in Pre-Trained Language Models
Kronecker Decomposition for GPT Compression
HYPHEN: Hyperbolic Hawkes Attention For Text Streams
BERT Learns to Teach: Knowledge Distillation with Meta Learning
bert2BERT: Towards Reusable Pretrained Language Models
[NUT] Identifying Moments of Change from Longitudinal User Text
[Bias] Measuring Fairness of Text Classifiers via Prediction Sensitivity
[Privacy] Sentence-level Privacy for Document Embeddings
[Semantics] Word2Box: Capturing Set-Theoretic Semantics of Words using Box Embeddings
[Language Modelling (LM) Tricks]
Universal Conditional Masked Language Pre-training for Neural Machine Translation
Better Language Model with Hypernym Class Prediction
The Actual Paper Reading!!
Before the actual reading let’s do some backward time management, because we have
Limited time (I’ll usually want to spend no more than 3-4 hours on a single blitz)
Finite brainpower
Only so much caffeine we can take in a day
Given a 4 hours blitz, we have ~1 hour on the filter and categorize process, we get 360 mins left for 50 papers. Thus, we have around 7 mins per paper.
But there must be a better way than to use 7 mins for each paper! No?
Yes, there is. We can make use of the paper categories to give us a little more leeway when we blitz through the paper. For example, we have 3 papers in the Multi-Word Expression / Compositionality topic, so we get 21 minutes.
Since they are the same topic, the time taken to read the Related Work or Previous Work sections can be collapsed since we don’t need to take much effort into reading that section in the second and third papers.
Since they are the same topic, the time read the “Related Work” or “Previous Work” sections can be collapsed since we don’t need to take much effort in reading that section in the second and third paper.
Most probably, you can also score bonus time if the approach the papers used are similar or they use the same “Transformer is All You Need” shiny hammer 🔨 .
Lets start with the “Multi-Words and Compositionality” Topic
First note when opening the pdf, the same author wrote the first two papers in the topic!
Starting with the “Can transformer be too compositional?” paper, first thing that you will notice is a common gestalt of an NLP paper from ACL, the existence of a figure or example on the top right in the second column on the first page.
Legend says, this top right column figure/example gestalt is commonly attributed to Percy Liang
Legend says, this top right column figure/example gestalt is commonly attributed to Percy Liang but I’m not too sure it’s true for every paper he authored or co-authored. For example:
And here’s the “summary” of the meta-trend for the three papers in this topic:
Can Transformer be Too Compositional? Analysing Idiom Processing in Neural Machine Translation.
Verna Dankers, Christopher Lucas and Ivan Titov.
Meta-Trend
Problem: Is non-compositionality inherent in transformer’s architecture?
Approach:
Probe the attention weights between the MWEs, across the MWE translations
Perform canonical correlation analysis (CCA)
Note:I’m not exactly sure what this CCA does but it looks like some technique to test similarity between sentence with masked MWE vs sentence with the actual MWE
Train probing classifier to predict if figurativeness can be easily predicted by a classifier
Note: Seems like one of those eXplainable AI (XAI) approach
Quality
Woot! (Strength)
Nice work on the different transformer architectures, poking to find out about a linguistic phenomenon
This paper’s “finding agrees with results from Zaninello and Birch (2020), who ascertain that encoding an idiom as one word improves translations”
Meh… (Weakness)
Analysis was done on translation from English → European languages, the assumption that idiomatic information “density”/salience is equivalence across the target languages
The Paradox of the Compositionality of Natural Language: A Neural Machine Translation Case Study.
Verna Dankers, Elia Bruni and Dieuwke Hupkes.
Meta-Trend
Problem: How to view compositionality more holistically for MT?
Approach:
[Local compositionality] Compare POS templates of synthetic data, semi-natural vs natural translation data
[Global compositionality] Analyse if the words in the full idiom phrase are literally translated or not
Note: I think this part of the analysis is manually annotated but I might be mistaken
Quality
Woot!
Identified that simplistic probing on compositionality with synthetic dataset (e.g. English idiom dictionary → European language translation) is too local to the MWE/idiom
Their “results indicate that models often do not behave compositionally under the local interpretation, but exhibit behaviour that is too local in other cases. In other words, models have the ability to process phrases both locally and globally but do not always correctly modulate between them.”
Meh…
The global analysis was based on a small subset of English idioms → Dutch translations
The discussion section is a little hand-wavy and more like opinions than analysis but it’s a good thing, it makes readers think a little on whether the claims are intuitive or not.
Problem: Better regularlization through subword perturbation
Note: Perturbation is just a big word for “disturbing”
Approach:
We “disturb” the subwords by replacing it with a random word or a random phrase (aka compositional subwords)
Instead of choosing word based on corpus’ average length, have a smarter way to sample the word to replace the subword, in the paper they use some poisson distribution using the length of the original word that we want to replace as the signal to the poisson distribution
Note: Why poisson? Are we trying to say that a “disturbance” or random word substitution is a “rare event” that usually we estimate using a Poisson distribution?
“[Word Replacement] WR often samples words unrelated to the target word owing to the uniform distribution QV . To address this problem, we propose [Compositional Word Replacement] CWR that restricts the source of sampling V to S_xi , which consists of two subsets: Substrings and Overlapped Subwords”
Note: The description of the CWR (and Algorithm 1) is kinda of terse and I’m going to skip it and somehow remember that they did some comparison of substrings and overlapping subwords and somehow choose a word to replace based on those signals
Nice results, the proposed approach gave the model a little boost
The collected sentiment analysis data set that is very useful!!
Weakness
Paper should have better examples and clearer explanation of how the CWR was carried out
Note: Maybe cos I’m in a blitz and I really don’t want to dive into the details
The motivation to choose Poisson should really be explained, it’s logical choice if we are assuming that these random perturbations are “rare events” but there should be more to the story of why the authors chose that distribution
Orthogonality of stacking the approaches isn’t well explored
Usefulness
Code and data availability
100,000 English Twitter tweets <https://www.kaggle.com/c/ twitter-sentiment-analysis>
671,052 Chinese Weibo samples <https://github.com/wansho/senti-weibo >
352,554 Japanese Twitter tweets <http://www.db.info.gifu-u.ac.jp/data/ Data_5d832973308d57446583ed9f> (not sure if this is the right link??)
120,000 English Amazon reviews (not sure how to download??)
525,000 Japanese Rakuten reviews (not sure how to download??)
390,000 Chinese JD.com reviews (not sure how to download??)
Time Check: 12 mins
Intermission…
Let’s do a quick round-up on the first topic on compositionality and see how we did in the blitz, we took 12 + 8 mins for the first two papers that are related and a little longer (12 mins) for the third paper since it was sorta in a different track, it’s more likely to be classified under LM Tricks rather than MWE and Compositionality but that’s the result of “judging the paper by its title” =)
Up till now, we spend a total of 32 mins for 3 papers, it’s a little over-time from our 21 minutes estimate but possibly we’ve saved future time in reading the related works for the LM Tricks topic. I hope the exercise up till this point gives you some tips on how researchers blitz through multiple papers in a short time and how reading more papers grouped in some manner help save time.
Ready for another topic to blitz through? 行くぞ…
Alright, alright, alright! I’ll spare you the torment =)
You definitely don’t want to spend 3-4 hours reading through my summary and I would suggest you take the time to do your own paper blitz too. But in actual fact, I did sit down for another 5+ hours to complete my blitz but my “meta-trend” notes are not as clean as the ones presented here. TBH, I did kinda overtime but considering spending around 6 hours in total to read through ~50 papers, I think I deserve to chill out with a cold サッポロビール to end my night.
Summary
This article introduces a mechanism that I personally use to read many papers quickly so that I can catch up with the ever-growing number of papers published in the Natural Language Processing (NLP) field.
My personal joy in reading papers is always thinking about what new nuggets of knowledge I can get from just a handful of papers and somehow figuring out how to try these approaches in my work projects. And blitzing through papers has helped me identify these nuggets faster than painfully doing shallow or deep dives popular papers or a handful of papers.
I hope that paper blitz can help newcomers in the field bootstrap knowledge of the field and “state-of-the-art” trends and also seasoned researchers save some time in finding the interesting nuggets in a haystack of accepted publications.
Until next time, have fun blitzing through academic papers, がんばれ!
P/S: I would not recommend doing this regularly and I usually only do it before the conference starts or post-conference as soon as the conference ends, at most 2-3 times a year. I usually only do 1-3 paper blitz a year, sometimes *ACL conference, sometimes purely on WMT and MT workshops papers, sometimes my filtering sets starts with a whole mass of “googled” results.