NBC News has from their investigation into how Russian Twitter Trolls may have influenced the 2016 US election. You can read about the results of NBC’s analysis in their stories and , but the focus of this post will be on We’ll show how to get started with the data and hopefully inspire you to dig into the data yourself. publicly released a database of deleted Tweets here here how you can explore the data on your own, using open source data analysis tools. 1. Neo4j Sandbox And Neo4j Browser NBC News has released the data as a Neo4j Database and CSV files that can be used with your favorite data analysis tools. But the easiest way to get started with the data is by using . Neo4j Sandbox allows you to spin up a private hosted instance of Neo4j pre-populated with interesting datasets. Neo4j Sandbox Use to spin up private hosted Neo4j instances pre-populated with interesting datasets. Neo4j Sandbox Once you’ve launched your Russian Twitter Trolls sandbox instance you’ll have access to Neo4j Browser, the query workbench for Neo4j that will allow you to interact with the database. Use Neo4j Browser to visually explore the database. 2. Query With Cypher , the query language for graphs is a great way to explore the database and can be used from within Neo4j Browser or by building an application using one of Neo4j’s . Cypher client drivers Cypher uses to allow users to express complex graph patterns to match against the graph. This allows for answering questions like: graph pattern matching What are the most commonly used hashtags by the Trolls? MATCH (t:Troll)-[:POSTED]->(tw:Tweet)-[:HAS_TAG]->(ht:Hashtag)RETURN ht.tag, COUNT(tw) AS numORDER BY num DESC What Troll accounts have the most followers? MATCH (u:Troll) WHERE EXISTS(u.followers_count)RETURN u.screen_name AS screen_name, u.followers_count AS followersORDER BY followers DESC LIMIT 50 What tweets contain the word “fraud”? MATCH (t:Troll)-[:POSTED]->(tw:Tweet)WHERE tw.text CONTAINS "fraud"OPTIONAL MATCH p=(tw)-[:HAS_TAG|HAS_LINK|MENTIONS|IN_REPLY_TO]-(a)RETURN * LIMIT 50 Find inferred relationships — what Trolls are retweeting other Trolls? MATCH p=(:Troll)-[:POSTED]->(:Tweet)<-[:RETWEETED]-(:Tweet)<-[:POSTED]-(:Troll)RETURN p LIMIT 10 Further ideas for querying: What are the most commonly used applications by the Trolls to post tweets? What locations do the Troll accounts list in their profiles? What tweets had the most number of retweets that were not from other Russian Troll accounts? 3. Fill In Missing Data Due to the way the data was collected there are some missing pieces. For example, some of the users are missing profile information and some tweets are missing metadata like number of likes and retweets. Missing Profile Information For example, the user “@TEN_GOP” is missing profile information in the database as this wasn’t captured: MATCH (u:Troll) WHERE u.screen_name = "TEN_GOP"RETURN u.id, u.screen_name, u.description, u.location, u.name-------------------------------------------------------------╒════════════╤═══════════════╤═══════════════╤════════════╤════════╕│"u.id" │"u.screen_name"│"u.description"│"u.location"│"u.name"│╞════════════╪═══════════════╪═══════════════╪════════════╪════════╡│"4224729994"│"TEN_GOP" │"" │"" │"" │└────────────┴───────────────┴───────────────┴────────────┴────────┘ We can reconstruct the Twitter profile URL for “@TEN_GOP”: https://twitter.com/TEN_GOP but because these accounts have been suspended by Twitter, all we see is: The Russian Troll accounts were suspended by Twitter, removing their data from Twitter.com and Twitter’s API We can check web caches, such as Internet Archive to find cached versions of these pages, which we may then be able to scrape. Internet Archive has an API for checking for cached versions of pages, for example: -------------------------------------------------------------------- http://archive.org/wayback/available?url=http://twitter.com/TEN_GOP { : " ", : { : { : "200", : true, : " ", : "20170818065026"}}} url http://twitter.com/TEN_GOP archived_snapshots closest status available url http://web.archive.org/web/20170818065026/https://twitter.com/TEN_GOP timestamp shows that the profile page for “@TEN_GOP” has been captured by Internet Archive and is available . here Missing Tweet Information We can also for tweets in the database that have missing information to check against caches: reconstruct the tweet URLs MATCH (u:Troll)-[:POSTED]->(t:Tweet) WHERE t.text = ""RETURN " " + u.screen_name + "/status/" + t.idAS tweet_urlLIMIT 10------------------------------------------------------------------╒═══════════════════════════════════════════════════════════╕│"tweet_url" │╞═══════════════════════════════════════════════════════════╡│"https://twitter.com/SCOTTGOHARD/status/781651098398494720"│├───────────────────────────────────────────────────────────┤│"https://twitter.com/SCOTTGOHARD/status/780602260401299456"│├───────────────────────────────────────────────────────────┤│"https://twitter.com/WarfareWW/status/783649582064467968" │├───────────────────────────────────────────────────────────┤│"https://twitter.com/WarfareWW/status/783642593137754114" │├───────────────────────────────────────────────────────────┤│"https://twitter.com/WarfareWW/status/756033388423897088" │├───────────────────────────────────────────────────────────┤│"https://twitter.com/WarfareWW/status/794918302585909250" │├───────────────────────────────────────────────────────────┤│"https://twitter.com/WarfareWW/status/787416487346708481" │├───────────────────────────────────────────────────────────┤│"https://twitter.com/WarfareWW/status/794189517653680132" │├───────────────────────────────────────────────────────────┤│"https://twitter.com/WarfareWW/status/797080157135761409" │├───────────────────────────────────────────────────────────┤│"https://twitter.com/WarfareWW/status/781515670379003904" │└───────────────────────────────────────────────────────────┘ https://twitter.com/ Unfurling URLs Many of the tweets contain URLs that use link shortener services so it’s not clear what pages they are actually sharing: MATCH (t:Troll)-[:POSTED]->(tw:Tweet)-[:HAS_LINK]->(u:URL)WHERE u.expanded_url CONTAINS "bit.ly"RETURN u.expanded_url LIMIT 10------------------------------------------------------------╒═══════════════════════╕│"u.expanded_url" │╞═══════════════════════╡│"http://bit.ly/2eeMnZR"│├───────────────────────┤│"http://bit.ly/2dCn9qP"│├───────────────────────┤│"http://bit.ly/2ctTjGN"│├───────────────────────┤│"http://bit.ly/2eAOBnf"│├───────────────────────┤│"http://bit.ly/2awlrUs"│├───────────────────────┤│"http://bit.ly/2aAtdyN"│├───────────────────────┤│"http://bit.ly/29UHsyx"│├───────────────────────┤│"http://bit.ly/2cOskmM"│├───────────────────────┤│"http://bit.ly/2cOskmM"│├───────────────────────┤│"http://bit.ly/2cOskmM"│└───────────────────────┘ We can use tools like cURL to unfurl the links to find the final destination URLs: ➜ curl -Ls -w %{url_effective} -o /dev/null http://bit.ly/2eeMnZR http://ksnt.com/2016/10/27/early-voting-more-good-signs-for-clinton-in-key-states/?utm_source=twitterfeed&utm_medium=twitter% Online tools like allow us to accomplish the same thing, but can also inspect page content and spoof user agents. unfurlr Further ideas for enriching the data: Supplementing the data with Google Knowledge Graph Checking web caches like and for cached versions of the deleted tweets archive.is Internet Archive Searching other social media platforms for usernames that have been reused 4. Graph Algorithms are a way to apply analytics to the entire graph to further enhance our understanding of the data. These algorithms fall into three categories: Graph algorithms — What are the most important nodes in the network. Centrality algorithms include PageRank, Betweenness Centrality, and Closeness Centrality. Centrality — How can the graph be partitioned? Community detection and clustering algorithms include Union Find, Louvain, Label Propagation, and Connected Components. Community detection — What are the shortest paths or best routes available given cost? Pathfinding algorithms include Minimum Weight Spanning Tree, All Pairs- and Single Source- Shortest Path, and Dijkstra. Pathfinding PageRank is a recursive graph algorithm that defines the importance of a node proportional to the importance and number of connected nodes in the graph. Image source Wikipedia We can run these algorithms in Neo4j with Cypher using the procedures. For example, here’s how to run PageRank on the Troll retweet graph: Neo4j Graph Algorithms CALL algo.pageRank("MATCH (t:Troll) RETURN id(t) AS id","MATCH (r1:Troll)-[:POSTED]->(:Tweet)<-[:RETWEETED]-(:Tweet)<-[:POSTED]-(r2:Troll) RETURN id(r2) as source, id(r1) as target", {graph:'cypher'}) For more examples of running graph algorithms on the Russian Troll dataset, see the Neo4j Browser guide for the . Russian Twitter Trolls Neo4j Sandbox instance Graph algorithms ideas: What are the most influential Troll accounts? Can you find communities in the graph based on interactions using community detection algorithms? 5. Graph Visualization Data visualization is often the best way to make sense of the results of graph algorithms. There are a number of open source tools for visualizing graph data, each with their own pros and cons. Tools such as , , and are commonly used for building interactive graph visualizations. Gephi vis.js Semiotic For those familiar with data visualization, graph data brings a unique set of challenges. Often the most important features of graph visualization are: binding to the importance, or centrality, of the node in the graph node size grouping the nodes together in clusters. Many graph visualization tools use a force directed layout to surface clusters, however we can also use community detection algorithms and to show clusters bind the communities to node color Showing proportional to a property, or weight, of the relationship relationship thickness Visualizing the Russian Troll retweet/reply network. Node size is proportional to PageRank, color shows the result of a community detection algorithm, and relationship thickness is determined by the number of retweets between the Trolls. The image above shows the results of running and on the Russian Troll retweet graph and visualized using a vis.js wrapper library called . PageRank community detection algorithms neovis.js Further Ideas: Can you imagine other types of inferred networks that can be extracted from the graph? How would you express those graphs using Cypher? Can you build an interactive graph visualization using one of the tools mentioned above to visualize that graph? 6. Natural Language Processing Natural language processing (NLP) is the process of making sense of text data. Common NLP tasks include part of speech tagging, entity extraction, word similarity, and sentiment analysis. There are a number of open source tools for performing NLP tasks such as , and even some tolls designed specifically for working with Twitter data such as and a crowd-sourced . Stanford’s CoreNLP tools NLTK in Python CMU’s Twitter Part-of-Speech Tagger tool for finding hashtag definitions Entity extraction on the Tweet data involves extending the graph model by annotating tweets that contain entities. You can even run NLP tasks using Cypher directly in Neo4j using an extension. See for how to get started using the . And for those comfortable with using Python tooling for NLP shows how to run entity extraction on Twitter data using Neo4j and Python. this post GraphAware neo4j-nlp procedures this blog post NLP ideas: — what are the most common people, organizations, and places mentioned in the tweets? Are certain groups of trolls talking about certain entities? Entity extraction — are the Trolls talking positively about anything? Or do they focus on spreading negativity only? Sentiment analysis We hope you’re excited to explore the data, share anything interesting you find with us on Twitter “@neo4j”.