paint-brush
Machine Learning and Linguistic Profiles Sheds Light on Q's Possible Authorsby@ethnology

Machine Learning and Linguistic Profiles Sheds Light on Q's Possible Authors

by EthnologyDecember 7th, 2024
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

Researchers have analysed 13 candidates for Q drops on 8chan. They found that the highest decision function is by far by Ron W. (Fig. 1) The most significant deviation from this concerns the first period of the QDrops, before the switch to 8chan, before Paul F. became the top candidate.
featured image - Machine Learning and Linguistic Profiles Sheds Light on Q's Possible Authors
Ethnology HackerNoon profile picture

Authors:

(1) Florian Cafiero (ORCID 0000-0002-1951-6942), Sciences Po, Medialab;

(2) Jean-Baptiste Camps (ORCID 0000-0003-0385-7037), Ecole nationale des chartes, Universite Paris, Sciences & Lettres.

Abstract and Introduction

Why work on QAnon? Specificities and social impact

Who is Q? The theories put to test

Authorship attribution

Results

Discussion

Corpus constitution

Quotes of authors outside of the corpus have been

Definition of two subcorpus: dealing with generic difference and an imbalanced dataset

The genre of “Q drops”: a methodological challenge

Detecting style changes: rolling stylometry

Ethical statement, Acknowledgements, and References

Results

Profiles capturing unconscious features of style such as grammatical morphemes have been built from two corpus of texts (a large corpus with all 13 candidates, but sometimes a low amount of relevant material for some of them, and a smaller, more controlled corpus) signed by each putative authors, using supervised machine learning, with a general attributive performance of over 97% (Materials and methods). They show that, for most of the slices, the highest decision function is by far by Ron W. (Fig. 1). The most significant deviation from this concerns the first period of the QDrops, before the switch to 8chan. In this period, the larger corpus analysis gives Paul F. as, by far, the top candidate, before a period where Paul F. and Ron W. signals are competing, until finally Ron W. signals takes over, after a second break that closely matches a tweet described by Paul F. himself as the last authentic Qdrop, that goes


There will be no further posts on this board under this ID.

This will verify the trip is safeguarded and in our control.

This will verify this board is compromised.

God bless each and every one of you.

Fight, fight, fight!

Q


The dominance of Paul F. in the first period is not seen at all on the smaller corpus analysis.


More secondarily, there are very localised spikes of Christina U. and Michael F. signals, especially in the more recent period of the QDrops. The rest of the candidates lag far behind.


Results obtained on the two rolling analyses, and their eventual difference, have to be contextualised by investigating the features who received the strongest coefficients in the different SVM classifiers (fig. 2). For some candidates, like Ron W., the features seem mostly idiolectal, like the 3-grams ‘nyb’, ‘ybo’ (in ‘anybody’) or the relative avoidance of ‘ th’ and ‘his’ and remain stable in between both analyses. This is also the case, for instance, for Donald T. whose most distinctive feature is ‘fak’, part of his very idiolectal ‘FAKE’, while other are more content related (‘mpg’ is even due to the regularity with which he mentioned ‘BrianKempGA’ in the training material), a consequence of the choice of characters 3-grams as features.


For authors like Christina U., the features are very content and news-related, like the 3-grams extracted from ‘Israel(i)’, ‘blm’, ‘psy’ (psychologists, psychiatrists, . . . ), etc.


In the case of Michael F., the features seem very dependent on the small quantity of the available training material, and the grandiloquent and religious nature of the few material available, with features such as ‘god’ (‘God’), ‘hty’ (‘almighty’), ‘lib’ (‘liberty’).




Figure 1: Decision function of each classifier for each successive overlapping windows of Q drops (windows of 1000 words, with steps of 200 words), arranged in chronological order for the large corpus (top) and the controlled corpus (bottom)


Finally and more importantly, these features, in their variation between analyses, give very good insight in the different results concerning Paul F. In the small corpus, due to the exclusion of his book, the most distinctive features for him are all cursory words and racist insults (‘ fu’, ‘fuc’, ‘uck’, ‘shi’, ‘hit’, ‘ ni’, ‘nig’, ‘igg’, ‘gge’, etc.); on the larger corpus, on the other hand, with the book included, they seem revealing of more neutral idiolectal (and grammatical) features, with pronouns, auxiliaries, determiners ( ‘he ’, ‘had’, ‘was’, ‘the’, etc.). These elements point to the larger corpus analysis being more reliable in what concerns Paul F. (especially in a crossgenre setup) than the smaller corpus analysis.


This paper is available on arxiv under CC BY 4.0 DEED license.