NBC News has publicly released a database of deleted Tweets from their investigation into how Russian Twitter Trolls may have influenced the 2016 US election. You can read about the results of NBC’s analysis in their stories here and here, but the focus of this post will be on how you can explore the data on your own, using open source data analysis tools. We’ll show how to get started with the data and hopefully inspire you to dig into the data yourself.
NBC News has released the data as a Neo4j Database and CSV files that can be used with your favorite data analysis tools. But the easiest way to get started with the data is by using Neo4j Sandbox. Neo4j Sandbox allows you to spin up a private hosted instance of Neo4j pre-populated with interesting datasets.
Use Neo4j Sandbox to spin up private hosted Neo4j instances pre-populated with interesting datasets.
Once you’ve launched your Russian Twitter Trolls sandbox instance you’ll have access to Neo4j Browser, the query workbench for Neo4j that will allow you to interact with the database.
Use Neo4j Browser to visually explore the database.
Cypher, the query language for graphs is a great way to explore the database and can be used from within Neo4j Browser or by building an application using one of Neo4j’s client drivers.
Cypher uses graph pattern matching to allow users to express complex graph patterns to match against the graph. This allows for answering questions like:
What are the most commonly used hashtags by the Trolls?
MATCH (t:Troll)-[:POSTED]->(tw:Tweet)-[:HAS_TAG]->(ht:Hashtag)RETURN ht.tag, COUNT(tw) AS numORDER BY num DESC
What Troll accounts have the most followers?
MATCH (u:Troll) WHERE EXISTS(u.followers_count)RETURN u.screen_name AS screen_name, u.followers_count AS followersORDER BY followers DESC LIMIT 50
What tweets contain the word “fraud”?
MATCH (t:Troll)-[:POSTED]->(tw:Tweet)WHERE tw.text CONTAINS "fraud"OPTIONAL MATCH p=(tw)-[:HAS_TAG|HAS_LINK|MENTIONS|IN_REPLY_TO]-(a)RETURN * LIMIT 50
Find inferred relationships — what Trolls are retweeting other Trolls?
MATCH p=(:Troll)-[:POSTED]->(:Tweet)<-[:RETWEETED]-(:Tweet)<-[:POSTED]-(:Troll)RETURN p LIMIT 10
Further ideas for querying:
Due to the way the data was collected there are some missing pieces. For example, some of the users are missing profile information and some tweets are missing metadata like number of likes and retweets.
Missing Profile Information
For example, the user “@TEN_GOP” is missing profile information in the database as this wasn’t captured:
MATCH (u:Troll) WHERE u.screen_name = "TEN_GOP"RETURN u.id, u.screen_name, u.description, u.location, u.name-------------------------------------------------------------╒════════════╤═══════════════╤═══════════════╤════════════╤════════╕│"u.id" │"u.screen_name"│"u.description"│"u.location"│"u.name"│╞════════════╪═══════════════╪═══════════════╪════════════╪════════╡│"4224729994"│"TEN_GOP" │"" │"" │"" │└────────────┴───────────────┴───────────────┴────────────┴────────┘
We can reconstruct the Twitter profile URL for “@TEN_GOP”:
https://twitter.com/TEN_GOP
but because these accounts have been suspended by Twitter, all we see is:
The Russian Troll accounts were suspended by Twitter, removing their data from Twitter.com and Twitter’s API
We can check web caches, such as Internet Archive to find cached versions of these pages, which we may then be able to scrape. Internet Archive has an API for checking for cached versions of pages, for example:
http://archive.org/wayback/available?url=http://twitter.com/TEN_GOP--------------------------------------------------------------------
{ url: "http://twitter.com/TEN_GOP",archived_snapshots: {closest: {status: "200",available: true,url: "http://web.archive.org/web/20170818065026/https://twitter.com/TEN_GOP",timestamp: "20170818065026"}}}
shows that the profile page for “@TEN_GOP” has been captured by Internet Archive and is available here.
Missing Tweet Information
We can also reconstruct the tweet URLs for tweets in the database that have missing information to check against caches:
MATCH (u:Troll)-[:POSTED]->(t:Tweet) WHERE t.text = ""RETURN "https://twitter.com/" + u.screen_name + "/status/" + t.idAS tweet_urlLIMIT 10------------------------------------------------------------------╒═══════════════════════════════════════════════════════════╕│"tweet_url" │╞═══════════════════════════════════════════════════════════╡│"https://twitter.com/SCOTTGOHARD/status/781651098398494720"│├───────────────────────────────────────────────────────────┤│"https://twitter.com/SCOTTGOHARD/status/780602260401299456"│├───────────────────────────────────────────────────────────┤│"https://twitter.com/WarfareWW/status/783649582064467968" │├───────────────────────────────────────────────────────────┤│"https://twitter.com/WarfareWW/status/783642593137754114" │├───────────────────────────────────────────────────────────┤│"https://twitter.com/WarfareWW/status/756033388423897088" │├───────────────────────────────────────────────────────────┤│"https://twitter.com/WarfareWW/status/794918302585909250" │├───────────────────────────────────────────────────────────┤│"https://twitter.com/WarfareWW/status/787416487346708481" │├───────────────────────────────────────────────────────────┤│"https://twitter.com/WarfareWW/status/794189517653680132" │├───────────────────────────────────────────────────────────┤│"https://twitter.com/WarfareWW/status/797080157135761409" │├───────────────────────────────────────────────────────────┤│"https://twitter.com/WarfareWW/status/781515670379003904" │└───────────────────────────────────────────────────────────┘
Unfurling URLs
Many of the tweets contain URLs that use link shortener services so it’s not clear what pages they are actually sharing:
MATCH (t:Troll)-[:POSTED]->(tw:Tweet)-[:HAS_LINK]->(u:URL)WHERE u.expanded_url CONTAINS "bit.ly"RETURN u.expanded_url LIMIT 10------------------------------------------------------------╒═══════════════════════╕│"u.expanded_url" │╞═══════════════════════╡│"http://bit.ly/2eeMnZR"│├───────────────────────┤│"http://bit.ly/2dCn9qP"│├───────────────────────┤│"http://bit.ly/2ctTjGN"│├───────────────────────┤│"http://bit.ly/2eAOBnf"│├───────────────────────┤│"http://bit.ly/2awlrUs"│├───────────────────────┤│"http://bit.ly/2aAtdyN"│├───────────────────────┤│"http://bit.ly/29UHsyx"│├───────────────────────┤│"http://bit.ly/2cOskmM"│├───────────────────────┤│"http://bit.ly/2cOskmM"│├───────────────────────┤│"http://bit.ly/2cOskmM"│└───────────────────────┘
We can use tools like cURL to unfurl the links to find the final destination URLs:
➜ curl -Ls -w %{url_effective} -o /dev/null http://bit.ly/2eeMnZR
Online tools like unfurlr allow us to accomplish the same thing, but can also inspect page content and spoof user agents.
Further ideas for enriching the data:
Graph algorithms are a way to apply analytics to the entire graph to further enhance our understanding of the data. These algorithms fall into three categories:
PageRank is a recursive graph algorithm that defines the importance of a node proportional to the importance and number of connected nodes in the graph. Image source Wikipedia
We can run these algorithms in Neo4j with Cypher using the Neo4j Graph Algorithms procedures. For example, here’s how to run PageRank on the Troll retweet graph:
CALL algo.pageRank("MATCH (t:Troll) RETURN id(t) AS id","MATCH (r1:Troll)-[:POSTED]->(:Tweet)<-[:RETWEETED]-(:Tweet)<-[:POSTED]-(r2:Troll) RETURN id(r2) as source, id(r1) as target", {graph:'cypher'})
For more examples of running graph algorithms on the Russian Troll dataset, see the Neo4j Browser guide for the Russian Twitter Trolls Neo4j Sandbox instance.
Graph algorithms ideas:
Data visualization is often the best way to make sense of the results of graph algorithms. There are a number of open source tools for visualizing graph data, each with their own pros and cons. Tools such as Gephi, vis.js, and Semiotic are commonly used for building interactive graph visualizations.
For those familiar with data visualization, graph data brings a unique set of challenges. Often the most important features of graph visualization are:
Visualizing the Russian Troll retweet/reply network. Node size is proportional to PageRank, color shows the result of a community detection algorithm, and relationship thickness is determined by the number of retweets between the Trolls.
The image above shows the results of running PageRank and community detection algorithms on the Russian Troll retweet graph and visualized using a vis.js wrapper library called neovis.js.
Further Ideas:
Can you imagine other types of inferred networks that can be extracted from the graph? How would you express those graphs using Cypher? Can you build an interactive graph visualization using one of the tools mentioned above to visualize that graph?
Natural language processing (NLP) is the process of making sense of text data. Common NLP tasks include part of speech tagging, entity extraction, word similarity, and sentiment analysis. There are a number of open source tools for performing NLP tasks such as Stanford’s CoreNLP tools, NLTK in Python and even some tolls designed specifically for working with Twitter data such as CMU’s Twitter Part-of-Speech Tagger and a crowd-sourced tool for finding hashtag definitions.
Entity extraction on the Tweet data involves extending the graph model by annotating tweets that contain entities.
You can even run NLP tasks using Cypher directly in Neo4j using an extension. See this post for how to get started using the GraphAware neo4j-nlp procedures. And for those comfortable with using Python tooling for NLP this blog post shows how to run entity extraction on Twitter data using Neo4j and Python.
NLP ideas:
We hope you’re excited to explore the data, share anything interesting you find with us on Twitter “@neo4j”.