Training a neural network to keep up with the latest ML papers on ArXiv
With the rising popularity of arXiv, researchers from around the world are posting preprints as soon as they have new ideas rather than waiting for conferences or peer review. Hundreds of ML-related preprints are published each week, so looking through all of the abstracts to figure out which ones are relevant to your interests, novel, and free of major methodological errors is prohibitively time consuming.
The best way to navigate this never-ending flow of preprints is human curators. I personally follow dozens of the Twitter accounts dozens of people in the ML community and as well as a handful of email newsletters that post recent papers (The Wild Week in AI by Denny Britz and Import AI by Jack Clark are my two favorites). This works well for finding the most prominent preprints of the week, but it doesn’t quite capture the long tail of interesting research. Most people are understandably busy and don’t curate more than a few papers a week.
The one exception is Miles Brundage. Miles is the Michael Jordan of tweeting arXiv preprints. He posts a giant Twitter thread with 20–50 papers several times per week. Since the start of 2017, he’s tweeted almost 6000 arXiv links. That’s over 20 a day. His tweets reliably identify a few interesting-looking papers each week that I wouldn’t have seen otherwise.
After a while I began to wonder: how does Miles select all of these papers? And can his process possibly be automated? Miles answered the first question on Twitter:
I created Brundage Bot to answer the second question.
Collecting the data set
First, I used the Twitter API to download of all of Miles’ tweets and parsed all the links to arxiv.org. Then I used the arXiv API to download metadata on every paper from the categories Miles mentioned looking at:
cs.* (Computer Science - All subcategories)
cond-mat.dis-nn (Physics - Disordered Systems and Neural Networks)
q-bio.NC (Quantitative Biology - Neurons and Cognition)
stat.CO (Statistics - Computation)
stat.ML (Statistics - Machine Learning)
The arXiv API returns the title, abstract, and authors for each paper. I joined the two datasets together by arXiv ID in each URL. As of today, the dataset contains 27k papers, ~5800 (21.5%) of which have been tweeted out by Miles. This graph shows how many papers were published on arXiv and tweeted by Miles each day:
Can we predict what Miles will tweet?
Next, I wanted to see if it was possible to predict whether Miles would tweet a paper using the information from the arXiv API (title, abstract, and authors).
I concatenated the title of each paper to its abstract and created tf-idf n-gram features (up to trigrams) from the text. I then concatenated one-hot-encoded vectors representing the paper’s authors and arXiv category. I filtered out n-grams that appeared less than 30 times in the training set (out of ~25k total abstracts) and authors who appeared less than 3 times. This left around 17k total features.
Finally, I held out a randomly-selected 10% of the data as a test set and trained a logistic regression using sklearn. I added L1 regularization (with the parameter chosen by cross-validation) and a class-weighted loss loss to help with the large number of features and class imbalance.
Precision was .71 and recall was .51. In other words, the logistic-regression-based Brundage Bot tweeted 51% of the set of papers that Miles tweeted. And 71% of the papers the bot tweeted were actually tweeted by Miles (if this doesn’t make sense, there is a nice visual explanation of precision and recall on Wikipedia).
Around 900 features picked up non-zero coefficients (most were 0 because of the L1 regularization). Here are the largest coefficients for the n-gram features:
Most positive coefficients (more likely that Miles will tweet the paper)
16. artificial intelligence
19. neural networks
Most negative coefficients (less likely that Miles will tweet the paper)
16. time series
A lot of the hottest topics in the field show up in the positive coefficients: reinforcement learning, generative adversarial networks, bias/fairness, variational auto-encoders. The negative coefficients seem to be indicators of either non-ML or applied ML papers. However, we can’t read too much into the coefficients alone without knowing how often each n-gram occurred and the correlations between them.
Neural Network Model
Next, I implemented a word-based convolutional neural network in Keras using the same features. This model creates embeddings for each word, then performs 1-D convolutions and a max pooling operation over them. Convolutional networks are computationally efficient (this took around the same amount of time to train as the logistic regression) and tend to perform well in text classification tasks. I would highly recommend Denny Britz’s blog post on the topic for the details of how these networks work and why they are so effective.
The data is somewhat small and noisy, so I had a lot of trouble with overfitting. I ended up using 64 dimension embeddings, filters ranging from 1–4 words in width, a very small 12 dimensional fully-connected layer, dropout, and early stopping after 3–4 epochs. The final model ended up with a precision of .70 and recall of .60 (around the same precision as the logistic regression, but with substantially higher recall). How could we make the model better?
One of the model’s weaknesses is its inability to accurately judge applied ML papers. In the October 16, 2017 batch of arXiv papers, Brundage Bot neglected several applied ML papers that Miles tweeted: ML applied to aviation safety, predicting which Kickstarter projects would be delivered on time, and modeling attention in human crowds.
The main problem appears to be higher variance in the quality of applied machine learning papers. For example, a paper using deep learning for dermatology could be an unknown student’s class project or a Stanford paper on the cover of Nature, and it is very hard to tell the difference using n-grams from the abstract. On the other hand, papers that deal with the mathematical and theoretical underpinnings of deep learning are both more consistently tweet-worthy and more easily distinguished by n-gram features. For example, the October 16 paper Bayesian Hypernetworks uses the phrase “complex multimodal approximate posterior with correlations between parameters” in its abstract. Both the real Miles and Brundage Bot picked it up.
One partial solution is to add the institutional affiliation of each author to the model. With a few exceptions, each author only appears a few times in the data set, so the author features are too sparse to provide much information. But institutions such as DeepMind, Google, or Stanford probably appear in the data often enough to be significant, so I think adding author affiliation could improve accuracy. However, it may be worth leaving them out to avoid privileging any institution based on how often its papers were tweeted in the past.
Using the PDF or LaTeX in addition to abstract text may also be useful in finding the most tweet-worthy papers. Higher quality papers probably have more diagrams and better typesetting on average.
(thanks to Miles for discussing his paper selection process with me and contributing the insights above!)
More Ambitious Ideas
I’m particularly interested in using machine learning to curate the latest arXiv papers because there don’t seem to be many tools available. Google scholar alerts can work well, but papers can take 2–3 days to show up after they have been posted. Twitter is a great source, but it’s difficult to track down the accounts that are relevant to each subfield of machine learning (is there any directory of academic researchers who actively tweet organized by field?).
It may be useful to train models to create streams of papers on specific topics (e.g., all the latest papers on Bayesian neural networks). Researchers could manually select a few keywords/relevant papers to define to the feed, and a model could look for new papers that contained similar terms.
In a perfect world, I’d automatically have a curated digest of papers relevant to my interests. I feel like this is possible for researchers who track their reading lists in software. I keep track of everything I’m reading in a series of Evernote notebooks. If could create a pipeline that could pull the abstracts from Evernote note, I could train a model tuned to my personal preferences. It might also be possible to create extensions for Papers.app or Google Scholar. I’d be interested to hear if this sounds useful or if there are any existing solutions — I’m amaub on Twitter and my email is email@example.com.