

NBC News has publicly released a database of deleted Tweets from their investigation into how Russian Twitter Trolls may have influenced the 2016 US election. You can read about the results of NBCโs analysis in their stories here and here, but the focus of this post will be on how you can explore the data on your own, using open source data analysis tools. Weโll show how to get started with the data and hopefully inspire you to dig into the data yourself.
NBC News has released the data as a Neo4j Database and CSV files that can be used with your favorite data analysis tools. But the easiest way to get started with the data is by using Neo4j Sandbox. Neo4j Sandbox allows you to spin up a private hosted instance of Neo4j pre-populated with interesting datasets.
Once youโve launched your Russian Twitter Trolls sandbox instance youโll have access to Neo4j Browser, the query workbench for Neo4j that will allow you to interact with the database.
Cypher, the query language for graphs is a great way to explore the database and can be used from within Neo4j Browser or by building an application using one of Neo4jโs client drivers.
Cypher uses graph pattern matching to allow users to express complex graph patterns to match against the graph. This allows for answering questions like:
What are the most commonly used hashtags by the Trolls?
MATCH (t:Troll)-[:POSTED]->(tw:Tweet)-[:HAS_TAG]->(ht:Hashtag)
RETURN ht.tag, COUNT(tw) AS num
ORDER BY num DESC
What Troll accounts have the most followers?
MATCH (u:Troll) WHERE EXISTS(u.followers_count)
RETURN u.screen_name AS screen_name, u.followers_count AS followers
ORDER BY followers DESC LIMIT 50
What tweets contain the word โfraudโ?
MATCH (t:Troll)-[:POSTED]->(tw:Tweet)
WHERE tw.text CONTAINS "fraud"
OPTIONAL MATCH p=(tw)-[:HAS_TAG|HAS_LINK|MENTIONS|IN_REPLY_TO]-(a)
RETURN * LIMIT 50
Find inferred relationshipsโโโwhat Trolls are retweeting other Trolls?
MATCH p=
(:Troll)-[:POSTED]->(:Tweet)<-[:RETWEETED]-(:Tweet)<-[:POSTED]-(:Troll)
RETURN p LIMIT 10
Further ideas for querying:
Due to the way the data was collected there are some missing pieces. For example, some of the users are missing profile information and some tweets are missing metadata like number of likes and retweets.
Missing Profile Information
For example, the user โ@TEN_GOPโ is missing profile information in the database as this wasnโt captured:
MATCH (u:Troll) WHERE u.screen_name = "TEN_GOP"
RETURN u.id, u.screen_name, u.description, u.location, u.name
-------------------------------------------------------------
โโโโโโโโโโโโโโคโโโโโโโโโโโโโโโโคโโโโโโโโโโโโโโโโคโโโโโโโโโโโโโคโโโโโโโโโ
โ"u.id" โ"u.screen_name"โ"u.description"โ"u.location"โ"u.name"โ
โโโโโโโโโโโโโโชโโโโโโโโโโโโโโโโชโโโโโโโโโโโโโโโโชโโโโโโโโโโโโโชโโโโโโโโโก
โ"4224729994"โ"TEN_GOP" โ"" โ"" โ"" โ
โโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโดโโโโโโโโโ
We can reconstruct the Twitter profile URL for โ@TEN_GOPโ:
https://twitter.com/TEN_GOP
but because these accounts have been suspended by Twitter, all we see is:
We can check web caches, such as Internet Archive to find cached versions of these pages, which we may then be able to scrape. Internet Archive has an API for checking for cached versions of pages, for example:
http://archive.org/wayback/available?url=http://twitter.com/TEN_GOP
--------------------------------------------------------------------
{ url: "http://twitter.com/TEN_GOP",
archived_snapshots: {
closest: {
status: "200",
available: true,
url: "http://web.archive.org/web/20170818065026/https://twitter.com/TEN_GOP",
timestamp: "20170818065026"
}
}
}
shows that the profile page for โ@TEN_GOPโ has been captured by Internet Archive and is available here.
Missing Tweet Information
We can also reconstruct the tweet URLs for tweets in the database that have missing information to check against caches:
MATCH (u:Troll)-[:POSTED]->(t:Tweet) WHERE t.text = ""
RETURN "https://twitter.com/" + u.screen_name + "/status/" + t.id
AS tweet_url
LIMIT 10
------------------------------------------------------------------
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ"tweet_url" โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโก
โ"https://twitter.com/SCOTTGOHARD/status/781651098398494720"โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ"https://twitter.com/SCOTTGOHARD/status/780602260401299456"โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ"https://twitter.com/WarfareWW/status/783649582064467968" โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ"https://twitter.com/WarfareWW/status/783642593137754114" โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ"https://twitter.com/WarfareWW/status/756033388423897088" โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ"https://twitter.com/WarfareWW/status/794918302585909250" โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ"https://twitter.com/WarfareWW/status/787416487346708481" โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ"https://twitter.com/WarfareWW/status/794189517653680132" โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ"https://twitter.com/WarfareWW/status/797080157135761409" โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ"https://twitter.com/WarfareWW/status/781515670379003904" โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Unfurling URLs
Many of the tweets contain URLs that use link shortener services so itโs not clear what pages they are actually sharing:
MATCH (t:Troll)-[:POSTED]->(tw:Tweet)-[:HAS_LINK]->(u:URL)
WHERE u.expanded_url CONTAINS "bit.ly"
RETURN u.expanded_url LIMIT 10
------------------------------------------------------------
โโโโโโโโโโโโโโโโโโโโโโโโโ
โ"u.expanded_url" โ
โโโโโโโโโโโโโโโโโโโโโโโโโก
โ"http://bit.ly/2eeMnZR"โ
โโโโโโโโโโโโโโโโโโโโโโโโโค
โ"http://bit.ly/2dCn9qP"โ
โโโโโโโโโโโโโโโโโโโโโโโโโค
โ"http://bit.ly/2ctTjGN"โ
โโโโโโโโโโโโโโโโโโโโโโโโโค
โ"http://bit.ly/2eAOBnf"โ
โโโโโโโโโโโโโโโโโโโโโโโโโค
โ"http://bit.ly/2awlrUs"โ
โโโโโโโโโโโโโโโโโโโโโโโโโค
โ"http://bit.ly/2aAtdyN"โ
โโโโโโโโโโโโโโโโโโโโโโโโโค
โ"http://bit.ly/29UHsyx"โ
โโโโโโโโโโโโโโโโโโโโโโโโโค
โ"http://bit.ly/2cOskmM"โ
โโโโโโโโโโโโโโโโโโโโโโโโโค
โ"http://bit.ly/2cOskmM"โ
โโโโโโโโโโโโโโโโโโโโโโโโโค
โ"http://bit.ly/2cOskmM"โ
โโโโโโโโโโโโโโโโโโโโโโโโโ
We can use tools like cURL to unfurl the links to find the final destination URLs:
โ curl -Ls -w %{url_effective} -o /dev/null http://bit.ly/2eeMnZR
Online tools like unfurlr allow us to accomplish the same thing, but can also inspect page content and spoof user agents.
Further ideas for enriching the data:
Graph algorithms are a way to apply analytics to the entire graph to further enhance our understanding of the data. These algorithms fall into three categories:
We can run these algorithms in Neo4j with Cypher using the Neo4j Graph Algorithms procedures. For example, hereโs how to run PageRank on the Troll retweet graph:
CALL algo.pageRank("MATCH (t:Troll) RETURN id(t) AS id",
"MATCH (r1:Troll)-[:POSTED]->(:Tweet)<-[:RETWEETED]-(:Tweet)<-[:POSTED]-(r2:Troll) RETURN id(r2) as source, id(r1) as target", {graph:'cypher'})
For more examples of running graph algorithms on the Russian Troll dataset, see the Neo4j Browser guide for the Russian Twitter Trolls Neo4j Sandbox instance.
Graph algorithms ideas:
Data visualization is often the best way to make sense of the results of graph algorithms. There are a number of open source tools for visualizing graph data, each with their own pros and cons. Tools such as Gephi, vis.js, and Semiotic are commonly used for building interactive graph visualizations.
For those familiar with data visualization, graph data brings a unique set of challenges. Often the most important features of graph visualization are:
The image above shows the results of running PageRank and community detection algorithms on the Russian Troll retweet graph and visualized using a vis.js wrapper library called neovis.js.
Further Ideas:
Can you imagine other types of inferred networks that can be extracted from the graph? How would you express those graphs using Cypher? Can you build an interactive graph visualization using one of the tools mentioned above to visualize that graph?
Natural language processing (NLP) is the process of making sense of text data. Common NLP tasks include part of speech tagging, entity extraction, word similarity, and sentiment analysis. There are a number of open source tools for performing NLP tasks such as Stanfordโs CoreNLP tools, NLTK in Python and even some tolls designed specifically for working with Twitter data such as CMUโs Twitter Part-of-Speech Tagger and a crowd-sourced tool for finding hashtag definitions.
You can even run NLP tasks using Cypher directly in Neo4j using an extension. See this post for how to get started using the GraphAware neo4j-nlp procedures. And for those comfortable with using Python tooling for NLP this blog post shows how to run entity extraction on Twitter data using Neo4j and Python.
NLP ideas:
We hope youโre excited to explore the data, share anything interesting you find with us on Twitter โ@neo4jโ.
Create your free account to unlock your custom reading experience.