our staff to writers ratio is 1 to 1428. Welcome home, Hackers.
Particularly over the last several years, researchers across a spectrum of scientific disciplines have studied the dynamics of social media networks to understand how information propagates as the networks evolve. Social media platforms like Twitter and Facebook include not only actual human users but also bots, or automated programs, that can significantly alter how certain messages are spread. While some information-gathering bots are beneficial or at least benign, it was made clear by the 2016 U.S. Presidential election and the 2017 elections in France that bots and sock puppet accounts (that is, numerous social accounts controlled by a single person) were effective in influencing political messaging and propagating misinformation on Twitter and Facebook. It is thus crucial to identify and classify social bots to combat the spread of misinformation and especially the propaganda of enemy states and violent extremist groups. This article is a brief summary of my recent bot detection research. It describes the techniques I applied and the results of identifying battling groups of viral bots and cyborgs that seek to sway opinions online.
For this research, I have applied techniques from complexity theory, especially information entropy, as well as network graph analysis and community detection algorithms to identify clusters of viral bots and cyborgs (human users who use software to automate and amplify their social posts) that differ from typical human users on Twitter and Facebook. I briefly explain these approaches below, so deep prior knowledge of these areas is not necessary. In addition to commercial bots focused on promoting click traffic, I discovered competing armies of pro-Trump and anti-Trump political bots and cyborgs. During August 2017, I found that anti-Trump bots were more successful than pro-Trump bots in spreading their messages. In contrast, during the NFL protest debates in September 2017, anti-NFL (and pro-Trump) bots and cyborgs achieved greater successes and virality than pro-NFL bots.
The data sets for my Twitter bot detection research consisted of ~60M tweets that mentioned the terms “Trump,” “Russia,” “FBI,” or “Comey”; the tweets were collected via the free Twitter public API in separate periods between May 2017 and September 2017. I have made the source tweet IDs as well as many of our analysis results files available in a data project published at data.world. Researchers who wish to collaborate on this project at data.worldshould send a request email to firstname.lastname@example.org.
Information entropy is defined as the “the average amount of information produced by a probabilistic stochastic source of data.” As such, it is one effective way to quantify the amount of randomness within a data set. Because one can reasonably conjecture that actual humans are more complicated than automated programs, entropy can be a useful signal when one is attempting to identify bots, as has been done by a number of previous researchers. Of the recent research in social bot detection, particularly notable is the excellent work by groups of researchers from the University of California and Indiana University. Their “botornot” system uses a random forest machine learning model that incorporates 1,150 features derived from user account metadata, friend/follower data, network characteristics, temporal features, content and language features, and sentiment analysis.
For our current work, I elected to adopt a greatly simplified approach for social bot detection using two types of information entropy scores — one based on the distributions of time lags between successive posts and a second based on the ordering of words within the posts. Accounts that send messages at uniform time intervals or with messages with unusually static or similar text context might be bots or cyborgs.
Next, I calculated the Z-scores of both the timing entropy and text entropy. In the results presented here, I set a minimum threshold of 10 social posts by a user in order to analyze said user’s posts, and then I applied a conservative threshold of 2.5 for the Z-score (that is, raw scores at or above 2.5 standard deviations above the mean) for either entropy metric in order to flag possible bots. By lowering the threshold I would, of course, detect more bots, but at the risk of false positives that might inadvertently flag actual human users as bots. In the future, I hope to calculate the ROC curve for my dual-entropy approach to characterize the tradeoffs between false positives and false negatives.
The k-core of a graph is a maximal subgraph in which each vertex has at least degree k. The coreness of a vertex is k if it belongs to the k-core but not to the (k+1)-core. The k-core decomposition is performed by recursively removing all the vertices (along with their respective edges) that have degrees less than k. Previous research has suggested that the k-core decomposition of a network can be very effective in identifying the individuals within a network who are best positioned to spread or share information. I used the k-core decomposition in 2016 to analyze more than 120M tweets related to the 2016 U.S. Presidential elections to identify the most influential users. For this bot detection research, I performed a k-core decomposition of the heterogeneous user/hashtag/URL Twitter networks for each day on which I collected samples between May and September 2017.
By combining our entropy scores with the corresponding coreness values, I was able to identify which bots or cyborgs (that is, humans who use specialized software to automate their social media posts) were most successful in prompting other users (some of which were also bots) to share or react to their posts, thus attaining positions closer to the center of the daily Twitter networks. (This k-core decomposition approach was used in a similar fashion by Bessi and Ferrara to measure the embeddedness of social bots.)
The 3-D scatter plot in Figure 1 shows clearly that the vast majority of the identified social bots are unsuccessful, remaining at the outer realms of the networks with low coreness values because no or few other users interact with them. Successful bots achieved higher coreness values because other users retweeted or replied to their posts. Normal human users (not shown) would be near the origin, and it is easy to discern that the higher the Z-score of either entropy metric, the less successful the bots become. This is most likely due to the fact that human users are readily able to recognize the bots’ abnormal postings and thus do not tend to share those bots’ posts. In summary, the more human-like the bot’s behavior, the more likely it is that actual users will share that bot’s posts.
The most extreme value of the text entropy Z-score (outside the plot boundaries) is 143 (with a raw text entropy of 1.0) for the Twitter user @says_k_to_trump. A few sample tweets are shown below. Note that every tweet is the single letter “k” sent in reply to each of @realDonaldTrump’s tweets. That entropy Z-score reflects the fact that this user’s tweets’ contents are completely deterministic with no uncertainty. Understandably, no other user has interacted with @says_k_to_trump, so that bot has remained at the outermost edge of the network with a coreness of 1.
The most extreme value of the timing entropy Z-score is 122.7 for the Twitter user @trade_debate. Note the very uniform timing pattern of that user’s tweets in Table 1. Starting with the second tweet, that user tweeted at a constant interval of two seconds.
In contrast, one of the most successful bots is @Bhola021, which achieved a coreness value of 96 on 2017–08–12. Several sample tweets are shown below in Table 2. This is primarily a digital marketing bot rather than a political or propaganda bot. Note, in particular, the behavior of retweeting other user accounts with similar names and very similar tweet text.
With the approach described above, one can identify potential bots and measure their degree of success, or embeddedness, within the evolving social networks. As we will see next, these results can be enhanced significantly with community detection algorithms.
To understand more clearly how the most successful viral bots and cyborgs function within the Twitter network, I created a sub-network based on the tweets sent by those bots, extracting user mentions and URLs from replies and retweets. In this example, I generated a network using the 16,057 tweets sent by the top 20 bot accounts from August 7–19, 2017. The generated network consists of 73,569 links among 2,949 nodes. A k-core decomposition of this network resulted in a maximum coreness of 20. I then applied the Louvain community detection algorithmto identify the relevant groups within the center of the network for all nodes with coreness ≥ 10. In the Polinode interactive network displayed in Figure 3, each color represents a different community within the network. Among the top 20 bots, there is a highly interconnected network of bots with similar names (porantext, porantexts_, lovedemand101, lovecommand102, etc.) that retweet and share each other’s posts. These botnets are evidently commercial bots that attempt to drive click traffic to webpages with provocative titles such as “Donald Trump Kicked One Direction Out Of His Hotel And Here’s Why” and “We Will Ruthlessly Ravage US troops, North Korea Warns Donald Trump On The Sun’s Day” as the top two article titles.
Because I am particularly interested in effects of social bots in spreading information and swaying public opinions in politics, I filtered the source tweets to include only those that include the word “Russia” in the tweet text. When I performed the k-core decomposition and entropy calculations on the Russia-related Twitter network, a different set of influential bots and cyborgs emerged for the period of August 7–19, 2017.
The Polinode network shown below in Figure 4 displays 17 different sub-groups in the network created by the top 20 Russia-related bots and cyborgs.
Community 1 is a pro-Trump group centered around the bot account named MyPlace4U (see Figure 5).
In contrast, Community 10 is an anti-Trump group centered around the Twitter account named RealMuckmaker (see Figure 6), which was actually the most successful cyborg in this data set.
Table 3 below lists the top 20 viral bots and cyborgs in the Trump/Russia Twitter network for August 7–19, 2017. Note that only six of the top 20 viral bots and cyborgs act to support Donald Trump. Trump-supporting users are highlighted in red. I chose each user’s sample tweet text by calculating the mean text similarity of each tweet to the rest of that user’s tweets and selecting the tweet with the highest mean similarity using the Levenshtein distance and the fuzzywuzzy Python module.
To discern how successful the different groups of Russia-related bots and cyborgs were in spreading their messages on Twitter, I calculated the daily mean and maximum coreness values attained by the six pro-Trump users in Table 3 versus the remaining 14 anti-Trump (or neutral) users in Table 3. Figure 7 (interactive version here) shows that, overall, the anti-Trump group was more successful in spreading its messages during the period of August 7–19, 2017, with the greatest peak on August 11 led by @RealMucker, which promoted a link to a particular CNN Politics article regarding the FBI’s raid on the home of former Trump campaign manager Paul Manafort.
I applied the same entropy-based bot detection and network analysis approach to over 1M tweets that included the terms “Trump” and “NFL” from September 14–25, 2017. The Polinode network shown below in Figure 8 displays 16 different sub-groups in the network created by the top 20 NFL-related bots and cyborgs. Nine of the groups are opposed to the NFL protests while seven are in favor of the NFL players who took a knee in protest.
As in the Russia-related example, I calculated the maximum daily coreness value for the pro-NFL and anti-NFL groups within the top 20 viral NFL-related bots. Figure 9 shows that the anti-NFL (and pro-Trump) bots and cyborgs were more successful in spreading their social content than the pro-NFL group. Refer to my data.world data project for further details.
Given the increasing number of reports of Russian involvement in last year’s elections across multiple social platforms, I wanted to apply the entropy-based bot detection method to election-related Facebook data. Our friend and research colleague Jonathon Morgan, the CEO of New Knowledge and co-founder of Data for Democracy, kindly provided a data set of 10.5M public Facebook comments from Donald Trump’s Facebook page collected between July 2016 and April 2017.
Unfortunately, because I have only the text content and timestamps of the users’ Facebook comments, I do not have the full social network structure available as I did in the previous Twitter examples. Consequently, it is not possible to perform the same type of k-core decomposition. I found that the number of “likes” is not a particularly strong or reliable predictor of the degree of success for a bot or cyborg. The 20 Facebook users with the most extreme Z-scores of text entropy are listed in Table 4 below. The top user, Nadya Noor, had a text entropy score more than 253 standard deviations above the mean score for the rest of the users.
Figure 10 shows a Google translation of one of that user’s typical, strongly anti-American comments.
In the future, I plan to apply community detection algorithms to the text content and embedded URLs in these Facebook bots’ posts to determine their primary discussion topics and political leanings.
In this article, I have demonstrated how it is readily possible to identify social bots and cyborgs on both Twitter and Facebook using information entropy and then to find groups of successful bots using network analysis and community detection. Given the extreme risks of disinformation and propaganda being spread through social media, it is our hope that this approach, along with the work of other researchers, will enable greater transparency and help protect democracy and the authenticity of online discourse. I invite researchers who wish to collaborate on studies of these data sets to request access to become collaborators on our data project hosted on data.world.