News spreads out of Twitter.. faster than anywhere else…It’s amazing how much of the “breaking news” that we see or hear every day starts with a tweet. From tweets which leave us questioning the nature of reality to twitter-politics all the way to crypto-twitter, what’s obvious is that Twitter is where it all starts.
Twitter feeds work great in a certain sense, we see tweets from people we follow and tweets liked by people we follow..but beyond the daily-active accounts most of us receive our news from link-shares or news media derived from tweets. There has been substantial effort across industries to create meaningful views of Twitter data-streams combining statistical classification models and other ML techniques to extract unique and interesting information.
At Snip, our writer recommendations engine runs a classification/clustering/topic-modeling pipeline to generate real-time news recommendations for writers. The current implementation extracts meaningful tweets and groups them into contextual clusters on a near real-time basis. It uses Logistic Regression for relevance classification and topic clusters are modeled via NMF with further extractive text summarization via weighting.
Through this post, I’ll outline some of the rationale and challenges in designing the system.
The high-level technical architecture consists of an aggregator front-end and a Machine Learning backend. The goal of the system is to provide a personalized twitter news feed to a user tailored to their interest with benchmarks and features based on the following metrics —
1. Combine Status and Timeline feeds and rank them in news-worthiness and accuracy.2. Remove spam/marketing tweets from a topic feed. 3. Combine data from complementary sources and filter opinionated tweets to create an unbiased view of the event.
1. The view should provide a chronological order of events as they happen and combine top and emerging events on the topic.2. Vectorize tweets against news corpus to filter out news-tweets.3. The system should not include inherent bias and result display should include discourse based on semantics and interactions alone.
1. News articles should be recent and up-to-date.2. If news article has been redacted upstream, remove it from stream as soon as feasible.
Twitter provides us with three basic query endpoints, the search, timeline and the stream. Each gives a different view of a topic, and can we used to create interesting features. In order to create a near-complete view of the topic including semantic similarities (people and related entities) the basic structure should be for a topic which allows to get the maximum amount of data while being able to scale to hundreds of topics without hitting rate limits.
[{“topic”: topicName, // Search topics“handles” : [], // Search timelines"keywords": keywords, // Sample Realtime stream}]
Parsing and assigning contextual information to tweets, however presents unique challenges. A simple popularity based index often does not suffice for calculating the “news-worthiness” of an article, tweets are by default not news-articles, a deluge of automated accounts skew the results of any system trying to determine the newsworthiness and accuracy of an article based on popularity index and user-feedback alone.
At the heart of this challenge lies a classification problem to filter bot-streams/opinionated tweets and user-posts from meaningful “newsworthy” texts. This is to prevent the proverbial “garbage-in garbage-out” problem. The classified data can then be processed via clustering and topic-modeling pipelines to create a consumable views for news around a topic. A supervised learning approach works well with carefully chosen feature vectors.
For our model training we have trained and tagged a corpus of few thousand tweets in each category on two metrics — SPAM and Relevance. This set will be used to train our models and validate the accuracy of the classifier for each category.
Feature vectors are combining text features, post metric features, combined with user-metrics and time-window for the tweet. For text features, we can combine TF-IDF vectors of lemmatized terms, with (1,2) n-grams and named entities.
Beyond the text processing though observation of tweets in a category we can identify semantics of spam tweets (such as containing terms like giveaway, rewards etc. or a large number of hashtags and mentions). We can build an additional index of spam features and pass it to the classification engine.
Additional features can be constructed via word2vec indexes of the tweets in a given category against a corpus of reliable news articles in the category to determine news-worthiness of the tweets.
We found that creating a general training model for classifying tweets across categories and topics often results in sacrificing accuracy.On the flip side manually tagging training data for each individual topic is time-consuming and not very scalable, while resulting in higher accuracy of predictions. We built a supervised category-wise classification model by providing a feature set with high correlation with relevance and spam. Observation of a sample of tweets in a given topic/category corpus gives ideas for features which may be generalizable or specific to the given category.
The model is created via creating feature vectors from the labeled sample. The labelled corpus (around 1200 tweets) was divided into training and validation sets (0.33 * len(lebelled_set))The feature vectors was created as described the the architecture diagram above. Features include a sparse TF-IDF vector, Doc2Vec similarity vector against a corpus of news articles on the topic and other language semantic features.
Once the model is generated and pickled, we can use it to predict accuracy and spam classifications against a sample of tweets aggregated in real-time.We can define an integer constant N or a timestamp T such that we can can filter a data frame to query against the last N tweets or inserted_at > T.We use Cassandra/Spark based distributed computing clusters to tokenize and label the tweets in near real-time and serve is via an API layer.
Topic modeling can be done based on textual and cosine similarity with timestamp decay functions. The aim of the topic-modeling system is to group topics on semantic measures and not just text alone. For the topic modeling we benchmarked NMF, LDA and LSI algorithms. In an unsupervised setting, the results over NMF were most applicable over a wide range of topics. Aiming to keep the number of topics as relevant as possible, we start with an initial approximation that in a cleaned data-set, the number of topics = num_of_records / 10. We can further merge topics based on semantic similarity of divergence.
num_topics = int(len(topic_frame)/10)
model = TopicModel(‘nmf’, n_topics=num_topics)
for topic_idx, top_terms in model.top_topic_terms(vectorizer.id_to_term, topics=-1, top_n=8, weights=True):// Top 8 Terms for each Topic, further filter by weights
for topic_idx, top_docs in model.top_topic_docs(doc_topic_matrix, topics=-1, top_n=6, weights=True):// Top 6 docs in each topic.
The final stage of the pipeline is extractive text-summarization, via extracting the sentence in the corpus with the highest weight. Weight is calculated by frequency of weighted terms and sentence grammar score.
At the end of the pipeline we are able to classify topics into relevant/non-relevant in terms of “newsworthiness” buckets with a high degree of accuracy and cluster the classified tweets into modeled buckets with a reasonable degree of accuracy via a combination of unsupervised topic modeling and entity-extraction algorithms.
The cost of building a classifier with a high degree of accuracy is that it’s not very generalizable across categories since features which determine relevance often differ across categories. We’d ideally like to build a single classifier for classifying tweets across categories but such a classifier will sacrifice accuracy of predictions for generalizability. Instead we choose to have a few classifiers for category buckets which can classify tweets for topics within the category bucket with a high degree of accuracy.Further exploratory clustering can be done on the basis of sentiment and time-decay metrics to provide even more granular clusters. Also a shout-out to the maintainers of the following awesome packages which make text-based ML so much fun to work with — SKLearn, Spacy, Textacy, Gensim