Conferencing and The Art of 'Paper Blitzing'

There are soooo many academic and industrial papers in the field of machine learning, natural language processing (NLP) and computer systems nowadays. And even in a single conference, it’s overwhelming. In this post, I’ll share one of the ways I do conference paper reading; I like to call it “ ”. paper blitz In a paper blitz session, the main goal is to cover the papers that I will find “interesting” at a really superficial level. And I really mean . The general idea is to understand the “meta-trends” of the conference submissions and group papers under the same trend so that it gives a more holistic view of either the or the . all ALL types of problems an approach can solve variations of an approach to solve a single problem type In a paper blitz session, the main goal is to cover the papers that I will find “interesting” at a really superficial level. And I really mean . all ALL For each paper, the objective is to identify: Meta-Trend Problem Approach Quality Strength Weakness Usefulness Code and data availability Natural extension =) How to apply my work/interest Before we continue, here’s a . A paper blitz is not a usual paper reading session nor a deep/shallow dive into the papers. We might also mistakenly critic papers since we are doing a really superficial read on the papers. I’ll reiterate that the goal here is to recall how many papers we can cover in a short time and not the precision of how well we understand the papers. Additionally, I’ll recommend that we bookmark the papers that deserve deeper dive as a follow-up to the paper blitz. disclaimer A paper blitz is not a usual paper reading session nor a deep/shallow dive to the papers. We might also mistakenly critic papers since we are doing a really superficial read on the papers. Filter the papers [15-20 mins] The first act of the paper blitz is to filter out ~50 papers that we want to read in the blitz. This is usually the hardest part and there’s no easy way to do it and in this case. So, we resort to . Unsolicited advice to paper authors, make the paper title informative and interesting. judging the paper by its title And now, はじめ… From , I would normally go through every title one by one and copy and paste the title that I find interesting into a notepad. There will definitely be inherent bias to choose papers of your pet topics, famous authors or simply NLP friends you’ve yet to contact from the pre-covid days. Don’t fight the bias and just put them into your list, but if your whole list is made up of NLP friends’ paper, you either have too many friends or cut down on your list and force yourself to go through the anthology again. Iterate the process until your list is made up of ~50 paper. https://aclanthology.org/events/acl-2022/ I would strongly recommend that you resist the urge and and also . The paper blitz would be more like a shopping discovery experience than an e-commerce search experience, think window-shopping scrolling through the e-commerce app and if you are at the conference physically while doing a paper blitz, think window shopping in a brick and mortar mall. not use CTR+F not pick up papers only based on a specific topic Here’s a pseudo-code for the filtering process of the paper blitz. def is_numberish(paper_blitz, max_num=50, ish_margin=0.2): return max_num * (1-ish_margin) < len(paper_blitz) < max_num * (1+ish_margin) paper_blitz = [] while is_numberish(num_paper, 50): for paper in acl_anthology: if select_title(paper): # Use your own select_title func, as desired. paper_blitz.append(paper) Drumrolls… ここに… ただ!! For ACL 2022, here’s my personal paper blitz list. You can see there’s definitely a bias to my pet topics: Can Transformer be Too Compositional? Analysing Idiom Processing in Neural Machine Translation The Paradox of the Compositionality of Natural Language: A Neural Machine Translation Case Study Challenges and Strategies in Cross-Cultural NLP Cross-Lingual Phrase Retrieval Early Stopping Based on Unlabeled Samples in Text Classification CLUES: A Benchmark for Learning Classifiers using Natural Language Explanations DiBiMT: A Novel Benchmark for Measuring Word Sense Disambiguation Biases in Machine Translation Distantly Supervised Named Entity Recognition via Confidence-Based Multi-Class Positive and Unlabeled Learning Educational Question Generation of Children Storybooks via Question Type Distribution Learning and Event-centric Summarization FairLex: A Multilingual Benchmark for Evaluating Fairness in Legal Text Processing Imputing Out-of-Vocabulary Embeddings with LOVE Makes Language Models Robust with Little Cost Identifying Moments of Change from Longitudinal User Text Is Attention Explanation? An Introduction to the Debate Measuring Fairness of Text Classifiers via Prediction Sensitivity Meta-learning via Language Model In-context Tuning Match the Script, Adapt if Multilingual: Analyzing the Effect of Multilingual Pretraining on Cross-lingual Transferability mLUKE: The Power of Entity Representations in Multilingual Pretrained Language Models QuoteR: A Benchmark of Quote Recommendation for Writing Robust Lottery Tickets for Pre-trained Language Models Sentence-level Privacy for Document Embeddings ToxiGen: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection UniTE: Unified Translation Evaluation Universal Conditional Masked Language Pre-training for Neural Machine Translation What Makes Reading Comprehension Questions Difficult? Word Order Does Matter and Shuffled Language Models Know It Word2Box: Capturing Set-Theoretic Semantics of Words using Box Embeddings Machine Translation for Livonian: Catering to 20 Speakers Kronecker Decomposition for GPT Compression HYPHEN: Hyperbolic Hawkes Attention For Text Streams As Little as Possible, as Much as Necessary: Detecting Over- and Undertranslations with Contrastive Conditioning QiuNiu: A Chinese Lyrics Generation System with Passage-Level Input Language Diversity: Visible to Humans, Exploitable by Machines A Natural Diet: Towards Improving Naturalness of Machine Translation Output Automatic Song Translation for Tonal Languages Dict-BERT: Enhancing Language Model Pre-training with Dictionary ELLE: Efficient Lifelong Pre-training for Emerging Data Finding the Dominant Winning Ticket in Pre-Trained Language Models Mukayese: Turkish NLP Strikes Back Rethinking Document-level Neural Machine Translation First the Worst: Finding Better Gender Translations During Beam Search Word-level Perturbation Considering Word Length and Compositional Subwords Unsupervised Preference-Aware Language Identification Are Prompt-based Models Clueless? BERT Learns to Teach: Knowledge Distillation with Meta Learning bert2BERT: Towards Reusable Pretrained Language Models Better Language Model with Hypernym Class Prediction Categorize the papers [25-30 mins] The filtered list still looks like a mental overload but that is the goal of the paper blitz, to cover as many as possible. The next step in the process is to categorize the papers, and how I usually do it is to first put a category on the first paper and then see if the second paper fits into the same category, if not create a second category, then iterate till the end of list, repeat the categorization process from the first and see if the papers need to be reshuffled across the categories, recur until desired. Here’s another psuedo-code for the categorization process: cat_to_paper = {} # Categories to Papers mapping. def categorize(paper): max_sim = 0 # Variable to keep the maximum similarity paper_category = None for cat in cat_to_paper: # cosine() and vectorize() are just illustrations proxy to our human brains. # Cosine is a similarity function, proxy to how our brain relate things. # Vectorize is a function to convert a paper into an abstract numerical vector. similarity = cosine(vectorize(paper), vectorize(cat)) if similarity > max_sim: max_sim = similarity paper_category = cat return paper_category # There's no fix satisfaction criteria, you'll have to come up with your own. while is_satisfied(cat_to_paper): for paper in paper_blitz: cat = categorize(paper) cat_to_paper[cat] = paper Presto! ほら! Here’s a categorized version of my filtered list from ACL 2022: Multi-Word Expression / Compositionality Can Transformer be Too Compositional? Analysing Idiom Processing in Neural Machine Translation The Paradox of the Compositionality of Natural Language: A Neural Machine Translation Case Study Word-level Perturbation Considering Word Length and Compositional Subwords Low-Resource Language / Problems Challenges and Strategies in Cross-Cultural NLP Imputing Out-of-Vocabulary Embeddings with LOVE Makes Language Models Robust with Little Cost Machine Translation for Livonian: Catering to 20 Speakers Language Diversity: Visible to Humans, Exploitable by Machines ELLE: Efficient Lifelong Pre-training for Emerging Data Multilinguality / Crosslingual NLP Cross-Lingual Phrase Retrieval Match the Script, Adapt if Multilingual: Analyzing the Effect of Multilingual Pretraining on Cross-lingual Transferability mLUKE: The Power of Entity Representations in Multilingual Pretrained Language Models Unsupervised / Semi-Supervised Early Stopping Based on Unlabeled Samples in Text Classification Distantly Supervised Named Entity Recognition via Confidence-Based Multi-Class Positive and Unlabeled Learning Unsupervised Preference-Aware Language Identification Datasets CLUES: A Benchmark for Learning Classifiers using Natural Language Explanations DiBiMT: A Novel Benchmark for Measuring WSD Biases in Machine Translation FairLex: A Multilingual Benchmark for Evaluating Fairness in Legal Text Processing QuoteR: A Benchmark of Quote Recommendation for Writing ToxiGen: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection Language Learning / Understanding Language Educational Question Generation of Children Storybooks via Question Type Distribution Learning and Event-centric Summarization What Makes Reading Comprehension Questions Difficult? Word Order Does Matter and Shuffled Language Models Know It Are Prompt-based Models Clueless? Machine Translation (MT) Tricks / MT Evaluation UniTE: Unified Translation Evaluation As Little as Possible, as Much as Necessary: Detecting Over- and Undertranslations A Natural Diet: Towards Improving Naturalness of Machine Translation Output First the Worst: Finding Better Gender Translations During Beam Search Applications QiuNiu: A Chinese Lyrics Generation System with Passage-Level Input Automatic Song Translation for Tonal Languages Mukayese: Turkish NLP Strikes Back Model Architectures / Optimization Is Attention Explanation? An Introduction to the Debate Meta-learning via Language Model In-context Tuning Robust Lottery Tickets for Pre-trained Language Models Finding the Dominant Winning Ticket in Pre-Trained Language Models Kronecker Decomposition for GPT Compression HYPHEN: Hyperbolic Hawkes Attention For Text Streams BERT Learns to Teach: Knowledge Distillation with Meta Learning bert2BERT: Towards Reusable Pretrained Language Models Rethinking Document-level Neural Machine Translation Misc Identifying Moments of Change from Longitudinal User Text [NUT] Measuring Fairness of Text Classifiers via Prediction Sensitivity [Bias] Sentence-level Privacy for Document Embeddings [Privacy] Word2Box: Capturing Set-Theoretic Semantics of Words using Box Embeddings [Semantics] [Language Modelling (LM) Tricks] Universal Conditional Masked Language Pre-training for Neural Machine Translation Better Language Model with Hypernym Class Prediction The Actual Paper Reading!! Before the actual reading let’s do some backward time management, because we have (I’ll usually want to spend no more than 3-4 hours on a single blitz) Limited time Finite brainpower we can take in a day Only so much caffeine Given a 4 hours blitz, we have ~1 hour on the filter and categorize process, we get 360 mins left for 50 papers. Thus, . we have around 7 mins per paper But there must be a better way than to use 7 mins for each paper! No? Yes, there is. We can make use of the paper categories to give us a little more leeway when we blitz through the paper. For example, we have 3 papers in the Multi-Word Expression / Compositionality topic, so we get 21 minutes. Since they are the same topic, the time taken to read the or sections can be collapsed since we don’t need to take much effort into reading that section in the second and third papers. Related Work Previous Work Since they are the same topic, the time read the “Related Work” or “Previous Work” sections can be collapsed since we don’t need to take much effort in reading that section in the second and third paper. Most probably, you can also score bonus time if the approach the papers used are similar or they use the same “ ” shiny hammer 🔨 . Transformer is All You Need Lets start with the “Multi-Words and Compositionality” Topic First note when opening the pdf, the in the topic! same author wrote the first two papers Starting with the “ ” paper, first thing that you will notice is a common of an NLP paper from ACL, the existence of a figure or example on the top right in the second column on the first page. Can transformer be too compositional? gestalt Legend says, this top right column figure/example gestalt is commonly attributed to Percy Liang Legend says, this top right column figure/example gestalt is commonly attributed to but I’m not too sure it’s true for every paper he authored or co-authored. For example: Percy Liang And here’s the “summary” of the meta-trend for the three papers in this topic: Can Transformer be Too Compositional? Analysing Idiom Processing in Neural Machine Translation. Verna Dankers, Christopher Lucas and Ivan Titov. Meta-Trend Problem: Is non-compositionality inherent in transformer’s architecture? Approach: Probe the attention weights between the MWEs, across the MWE translations Perform canonical correlation analysis (CCA) Note: I’m not exactly sure what this CCA does but it looks like some technique to test similarity between sentence with masked MWE vs sentence with the actual MWE Train to predict if figurativeness can be easily predicted by a classifier probing classifier Seems like one of those eXplainable AI (XAI) approach Note: Quality Woot! (Strength) Nice work on the different transformer architectures, poking to find out about a linguistic phenomenon This paper’s “ ” finding agrees with results from Zaninello and Birch (2020), who ascertain that encoding an idiom as one word improves translations Meh… (Weakness) Analysis was done on translation from English → European languages, the assumption that idiomatic information “density”/salience is equivalence across the target languages Supposedly the analysis of the results should appear in but link 404 https://github.com/vernadankers/mt_idioms Usefulness Code and data availability: 1756 English idioms from the Oxford Dictionary of English with 57k occurrences. Magpie corpus https://github.com/hslh/magpie-corpus Natural extension Do the same on CJK / African languages Try reversing the directionality from X → English 12 mins Time check: The Paradox of the Compositionality of Natural Language: A Neural Machine Translation Case Study. Verna Dankers, Elia Bruni and Dieuwke Hupkes. Meta-Trend Problem: How to view compositionality more holistically for MT? Approach: [Local compositionality] Compare POS templates of synthetic data, semi-natural vs natural translation data [Global compositionality] Analyse if the words in the full idiom phrase are literally translated or not I think this part of the analysis is manually annotated but I might be mistaken Note: Quality Woot! Identified that simplistic probing on compositionality with synthetic dataset (e.g. English idiom dictionary → European language translation) is too local to the MWE/idiom Their “ ” results indicate that models often do not behave compositionally under the local interpretation, but exhibit behaviour that is too local in other cases. In other words, models have the ability to process phrases both locally and globally but do not always correctly modulate between them. Meh… The global analysis was based on a small subset of English idioms → Dutch translations The discussion section is a little hand-wavy and more like opinions than analysis but it’s a good thing, it makes readers think a little on whether the claims are intuitive or not. Usefulness Code and data availability: Magpie corpus https://github.com/hslh/magpie-corpus Some templates for future work to replicate that POS pattern idiomatic comparison 8 mins Time check: Word-level Perturbation Considering Word Length and Compositional Subwords. Tatsuya Hiraoka, Sho Takase, Kei Uchiumi, Atsushi Keyaki, Naoaki Okazaki Meta-Trend Problem: Better regularlization through subword perturbation Perturbation is just a big word for “disturbing” Note: Approach: We “disturb” the subwords by replacing it with a random word or a random phrase (aka compositional subwords) Instead of choosing word based on corpus’ average length, have a smarter way to sample the word to replace the subword, in the paper they use some poisson distribution using the length of the original word that we want to replace as the signal to the poisson distribution Why poisson? Are we trying to say that a “disturbance” or random word substitution is a “rare event” that usually we estimate using a Poisson distribution? Note: “[ ] [ ] ” Word Replacement WR often samples words unrelated to the target word owing to the uniform distribution QV . To address this problem, we propose Compositional Word Replacement CWR that restricts the source of sampling V to S_xi , which consists of two subsets: Substrings and Overlapped Subwords The description of the CWR (and Algorithm 1) is kinda of terse and I’m going to skip it and somehow remember that they did some comparison of substrings and overlapping subwords and somehow choose a word to replace based on those signals Note: Results tested on sentiment analysis (various datasets) + machine translation (IWSLT) Quality Strength Nice results, the proposed approach gave the model a little boost The collected sentiment analysis data set that is very useful!! Weakness Paper should have better examples and clearer explanation of how the CWR was carried out Maybe cos I’m in a blitz and I really don’t want to dive into the details Note: The motivation to choose Poisson should really be explained, it’s logical choice if we are assuming that these random perturbations are “rare events” but there should be more to the story of why the authors chose that distribution Orthogonality of stacking the approaches isn’t well explored Usefulness Code and data availability 100,000 English Twitter tweets 671,052 Chinese Weibo samples 352,554 Japanese Twitter tweets (not sure if this is the right link??) 120,000 English Amazon reviews (not sure how to download??) 525,000 Japanese Rakuten reviews (not sure how to download??) 390,000 Chinese JD.com reviews (not sure how to download??) 12 mins Time Check: Intermission… Let’s do a quick round-up on the first topic on compositionality and see how we did in the blitz, we took 12 + 8 mins for the first two papers that are related and a little longer (12 mins) for the third paper since it was sorta in a different track, it’s more likely to be classified under rather than but that’s the result of “ ” =) LM Tricks MWE and Compositionality judging the paper by its title Up till now, we spend a total of 32 mins for 3 papers, it’s a little over-time from our 21 minutes estimate but possibly we’ve saved future time in reading the related works for the LM Tricks topic. I hope the exercise up till this point gives you some tips on how researchers blitz through multiple papers in a short time and how reading more papers grouped in some manner help save time. Ready for another topic to blitz through? 行くぞ… Alright, alright, alright! I’ll spare you the torment =) You definitely don’t want to spend 3-4 hours reading through my summary and I would suggest you take the time to do your own paper blitz too. But in actual fact, I did sit down for another 5+ hours to complete my blitz but my “meta-trend” notes are not as clean as the ones presented here. TBH, I did kinda overtime but considering spending around 6 hours in total to read through ~50 papers, I think I deserve to chill out with a cold サッポロビール to end my night. Summary This article introduces a mechanism that I personally use to read many papers quickly so that I can catch up with the ever-growing number of papers published in the Natural Language Processing (NLP) field. My personal joy in reading papers is always thinking about what new nuggets of knowledge I can get from just a handful of papers and somehow figuring out how to try these approaches in my work projects. And blitzing through papers has helped me identify these nuggets faster than painfully doing shallow or deep dives popular papers or a handful of papers. I hope that paper blitz can help newcomers in the field bootstrap knowledge of the field and “state-of-the-art” trends and also seasoned researchers save some time in finding the interesting nuggets in a haystack of accepted publications. Until next time, have fun blitzing through academic papers, がんばれ! P/S: I would not recommend doing this regularly and I usually only do it before the conference starts or post-conference as soon as the conference ends, at most 2-3 times a year. I usually only do 1-3 paper blitz a year, sometimes *ACL conference, sometimes purely on WMT and MT workshops papers, sometimes my filtering sets starts with a whole mass of “googled” results.