Six Ways To Explore The Russian Twitter Trolls Database In Neo4j

NBC News has publicly released a database of deleted Tweets from their investigation into how Russian Twitter Trolls may have influenced the 2016 US election. You can read about the results of NBC’s analysis in their stories here and here, but the focus of this post will be on how you can explore the data on your own, using open source data analysis tools. We’ll show how to get started with the data and hopefully inspire you to dig into the data yourself.

1. Neo4j Sandbox And Neo4j Browser

NBC News has released the data as a Neo4j Database and CSV files that can be used with your favorite data analysis tools. But the easiest way to get started with the data is by using Neo4j Sandbox. Neo4j Sandbox allows you to spin up a private hosted instance of Neo4j pre-populated with interesting datasets.

Use Neo4j Sandbox to spin up private hosted Neo4j instances pre-populated with interesting datasets.

Once you’ve launched your Russian Twitter Trolls sandbox instance you’ll have access to Neo4j Browser, the query workbench for Neo4j that will allow you to interact with the database.

Use Neo4j Browser to visually explore the database.

2. Query With Cypher

Cypher, the query language for graphs is a great way to explore the database and can be used from within Neo4j Browser or by building an application using one of Neo4j’s client drivers.

Cypher uses graph pattern matching to allow users to express complex graph patterns to match against the graph. This allows for answering questions like:

What are the most commonly used hashtags by the Trolls?

MATCH (t:Troll)-[:POSTED]->(tw:Tweet)-[:HAS_TAG]->(ht:Hashtag)RETURN ht.tag, COUNT(tw) AS numORDER BY num DESC

What Troll accounts have the most followers?

MATCH (u:Troll) WHERE EXISTS(u.followers_count)RETURN u.screen_name AS screen_name, u.followers_count AS followersORDER BY followers DESC LIMIT 50

What tweets contain the word “fraud”?

MATCH (t:Troll)-[:POSTED]->(tw:Tweet)WHERE tw.text CONTAINS "fraud"OPTIONAL MATCH p=(tw)-[:HAS_TAG|HAS_LINK|MENTIONS|IN_REPLY_TO]-(a)RETURN * LIMIT 50

Find inferred relationships — what Trolls are retweeting other Trolls?

MATCH p=(:Troll)-[:POSTED]->(:Tweet)<-[:RETWEETED]-(:Tweet)<-[:POSTED]-(:Troll)RETURN p LIMIT 10

Further ideas for querying:

What are the most commonly used applications by the Trolls to post tweets?
What locations do the Troll accounts list in their profiles?
What tweets had the most number of retweets that were not from other Russian Troll accounts?

3. Fill In Missing Data

Due to the way the data was collected there are some missing pieces. For example, some of the users are missing profile information and some tweets are missing metadata like number of likes and retweets.

Missing Profile Information

For example, the user “@TEN_GOP” is missing profile information in the database as this wasn’t captured:

MATCH (u:Troll) WHERE u.screen_name = "TEN_GOP"RETURN u.id, u.screen_name, u.description, u.location, u.name-------------------------------------------------------------╒════════════╤═══════════════╤═══════════════╤════════════╤════════╕│"u.id" │"u.screen_name"│"u.description"│"u.location"│"u.name"│╞════════════╪═══════════════╪═══════════════╪════════════╪════════╡│"4224729994"│"TEN_GOP" │"" │"" │"" │└────────────┴───────────────┴───────────────┴────────────┴────────┘

We can reconstruct the Twitter profile URL for “@TEN_GOP”:

https://twitter.com/TEN_GOP

but because these accounts have been suspended by Twitter, all we see is:

The Russian Troll accounts were suspended by Twitter, removing their data from Twitter.com and Twitter’s API

We can check web caches, such as Internet Archive to find cached versions of these pages, which we may then be able to scrape. Internet Archive has an API for checking for cached versions of pages, for example:

http://archive.org/wayback/available?url=http://twitter.com/TEN_GOP--------------------------------------------------------------------

{ url: "http://twitter.com/TEN_GOP",archived_snapshots: {closest: {status: "200",available: true,url: "http://web.archive.org/web/20170818065026/https://twitter.com/TEN_GOP",timestamp: "20170818065026"}}}

shows that the profile page for “@TEN_GOP” has been captured by Internet Archive and is available here.

Missing Tweet Information

We can also reconstruct the tweet URLs for tweets in the database that have missing information to check against caches:

MATCH (u:Troll)-[:POSTED]->(t:Tweet) WHERE t.text = ""RETURN "https://twitter.com/" + u.screen_name + "/status/" + t.idAS tweet_urlLIMIT 10------------------------------------------------------------------╒═══════════════════════════════════════════════════════════╕│"tweet_url" │╞═══════════════════════════════════════════════════════════╡│"https://twitter.com/SCOTTGOHARD/status/781651098398494720"│├───────────────────────────────────────────────────────────┤│"https://twitter.com/SCOTTGOHARD/status/780602260401299456"│├───────────────────────────────────────────────────────────┤│"https://twitter.com/WarfareWW/status/783649582064467968" │├───────────────────────────────────────────────────────────┤│"https://twitter.com/WarfareWW/status/783642593137754114" │├───────────────────────────────────────────────────────────┤│"https://twitter.com/WarfareWW/status/756033388423897088" │├───────────────────────────────────────────────────────────┤│"https://twitter.com/WarfareWW/status/794918302585909250" │├───────────────────────────────────────────────────────────┤│"https://twitter.com/WarfareWW/status/787416487346708481" │├───────────────────────────────────────────────────────────┤│"https://twitter.com/WarfareWW/status/794189517653680132" │├───────────────────────────────────────────────────────────┤│"https://twitter.com/WarfareWW/status/797080157135761409" │├───────────────────────────────────────────────────────────┤│"https://twitter.com/WarfareWW/status/781515670379003904" │└───────────────────────────────────────────────────────────┘

Unfurling URLs

Many of the tweets contain URLs that use link shortener services so it’s not clear what pages they are actually sharing:

MATCH (t:Troll)-[:POSTED]->(tw:Tweet)-[:HAS_LINK]->(u:URL)WHERE u.expanded_url CONTAINS "bit.ly"RETURN u.expanded_url LIMIT 10------------------------------------------------------------╒═══════════════════════╕│"u.expanded_url" │╞═══════════════════════╡│"http://bit.ly/2eeMnZR"│├───────────────────────┤│"http://bit.ly/2dCn9qP"│├───────────────────────┤│"http://bit.ly/2ctTjGN"│├───────────────────────┤│"http://bit.ly/2eAOBnf"│├───────────────────────┤│"http://bit.ly/2awlrUs"│├───────────────────────┤│"http://bit.ly/2aAtdyN"│├───────────────────────┤│"http://bit.ly/29UHsyx"│├───────────────────────┤│"http://bit.ly/2cOskmM"│├───────────────────────┤│"http://bit.ly/2cOskmM"│├───────────────────────┤│"http://bit.ly/2cOskmM"│└───────────────────────┘

We can use tools like cURL to unfurl the links to find the final destination URLs:

➜ curl -Ls -w %{url_effective} -o /dev/null http://bit.ly/2eeMnZR

http://ksnt.com/2016/10/27/early-voting-more-good-signs-for-clinton-in-key-states/?utm_source=twitterfeed&utm_medium=twitter%

Online tools like unfurlr allow us to accomplish the same thing, but can also inspect page content and spoof user agents.

Further ideas for enriching the data:

Supplementing the data with Google Knowledge Graph
Checking web caches like archive.is and Internet Archive for cached versions of the deleted tweets
Searching other social media platforms for usernames that have been reused

4. Graph Algorithms

Graph algorithms are a way to apply analytics to the entire graph to further enhance our understanding of the data. These algorithms fall into three categories:

Centrality — What are the most important nodes in the network. Centrality algorithms include PageRank, Betweenness Centrality, and Closeness Centrality.
Community detection — How can the graph be partitioned? Community detection and clustering algorithms include Union Find, Louvain, Label Propagation, and Connected Components.
Pathfinding — What are the shortest paths or best routes available given cost? Pathfinding algorithms include Minimum Weight Spanning Tree, All Pairs- and Single Source- Shortest Path, and Dijkstra.

PageRank is a recursive graph algorithm that defines the importance of a node proportional to the importance and number of connected nodes in the graph. Image source Wikipedia

We can run these algorithms in Neo4j with Cypher using the Neo4j Graph Algorithms procedures. For example, here’s how to run PageRank on the Troll retweet graph:

CALL algo.pageRank("MATCH (t:Troll) RETURN id(t) AS id","MATCH (r1:Troll)-[:POSTED]->(:Tweet)<-[:RETWEETED]-(:Tweet)<-[:POSTED]-(r2:Troll) RETURN id(r2) as source, id(r1) as target", {graph:'cypher'})

For more examples of running graph algorithms on the Russian Troll dataset, see the Neo4j Browser guide for the Russian Twitter Trolls Neo4j Sandbox instance.

Graph algorithms ideas:

What are the most influential Troll accounts?
Can you find communities in the graph based on interactions using community detection algorithms?

5. Graph Visualization

Data visualization is often the best way to make sense of the results of graph algorithms. There are a number of open source tools for visualizing graph data, each with their own pros and cons. Tools such as Gephi, vis.js, and Semiotic are commonly used for building interactive graph visualizations.

For those familiar with data visualization, graph data brings a unique set of challenges. Often the most important features of graph visualization are:

binding node size to the importance, or centrality, of the node in the graph
grouping the nodes together in clusters. Many graph visualization tools use a force directed layout to surface clusters, however we can also use community detection algorithms and bind the communities to node color to show clusters
Showing relationship thickness proportional to a property, or weight, of the relationship

Visualizing the Russian Troll retweet/reply network. Node size is proportional to PageRank, color shows the result of a community detection algorithm, and relationship thickness is determined by the number of retweets between the Trolls.

The image above shows the results of running PageRank and community detection algorithms on the Russian Troll retweet graph and visualized using a vis.js wrapper library called neovis.js.

Further Ideas:

Can you imagine other types of inferred networks that can be extracted from the graph? How would you express those graphs using Cypher? Can you build an interactive graph visualization using one of the tools mentioned above to visualize that graph?

6. Natural Language Processing

Natural language processing (NLP) is the process of making sense of text data. Common NLP tasks include part of speech tagging, entity extraction, word similarity, and sentiment analysis. There are a number of open source tools for performing NLP tasks such as Stanford’s CoreNLP tools, NLTK in Python and even some tolls designed specifically for working with Twitter data such as CMU’s Twitter Part-of-Speech Tagger and a crowd-sourced tool for finding hashtag definitions.

Entity extraction on the Tweet data involves extending the graph model by annotating tweets that contain entities.

You can even run NLP tasks using Cypher directly in Neo4j using an extension. See this post for how to get started using the GraphAware neo4j-nlp procedures. And for those comfortable with using Python tooling for NLP this blog post shows how to run entity extraction on Twitter data using Neo4j and Python.

NLP ideas:

Entity extraction — what are the most common people, organizations, and places mentioned in the tweets? Are certain groups of trolls talking about certain entities?
Sentiment analysis — are the Trolls talking positively about anything? Or do they focus on spreading negativity only?

We hope you’re excited to explore the data, share anything interesting you find with us on Twitter “@neo4j”.