Machine Learning and Linguistic Profiles Sheds Light on Q's Possible Authors

Authors:

(1) Florian Cafiero (ORCID 0000-0002-1951-6942), Sciences Po, Medialab;

(2) Jean-Baptiste Camps (ORCID 0000-0003-0385-7037), Ecole nationale des chartes, Universite Paris, Sciences & Lettres.

Table of Links

Abstract and Introduction

Why work on QAnon? Specificities and social impact

Who is Q? The theories put to test

Authorship attribution

Results

Discussion

Corpus constitution

Quotes of authors outside of the corpus have been

Definition of two subcorpus: dealing with generic difference and an imbalanced dataset

The genre of “Q drops”: a methodological challenge

Detecting style changes: rolling stylometry

Ethical statement, Acknowledgements, and References

Results

Profiles capturing unconscious features of style such as grammatical morphemes have been built from two corpus of texts (a large corpus with all 13 candidates, but sometimes a low amount of relevant material for some of them, and a smaller, more controlled corpus) signed by each putative authors, using supervised machine learning, with a general attributive performance of over 97% (Materials and methods). They show that, for most of the slices, the highest decision function is by far by Ron W. (Fig. 1). The most significant deviation from this concerns the first period of the QDrops, before the switch to 8chan. In this period, the larger corpus analysis gives Paul F. as, by far, the top candidate, before a period where Paul F. and Ron W. signals are competing, until finally Ron W. signals takes over, after a second break that closely matches a tweet described by Paul F. himself as the last authentic Qdrop, that goes

There will be no further posts on this board under this ID.

This will verify the trip is safeguarded and in our control.

This will verify this board is compromised.

God bless each and every one of you.

Fight, fight, fight!

The dominance of Paul F. in the first period is not seen at all on the smaller corpus analysis.

More secondarily, there are very localised spikes of Christina U. and Michael F. signals, especially in the more recent period of the QDrops. The rest of the candidates lag far behind.

Results obtained on the two rolling analyses, and their eventual difference, have to be contextualised by investigating the features who received the strongest coefficients in the different SVM classifiers (fig. 2). For some candidates, like Ron W., the features seem mostly idiolectal, like the 3-grams ‘nyb’, ‘ybo’ (in ‘anybody’) or the relative avoidance of ‘ th’ and ‘his’ and remain stable in between both analyses. This is also the case, for instance, for Donald T. whose most distinctive feature is ‘fak’, part of his very idiolectal ‘FAKE’, while other are more content related (‘mpg’ is even due to the regularity with which he mentioned ‘BrianKempGA’ in the training material), a consequence of the choice of characters 3-grams as features.

For authors like Christina U., the features are very content and news-related, like the 3-grams extracted from ‘Israel(i)’, ‘blm’, ‘psy’ (psychologists, psychiatrists, . . . ), etc.

In the case of Michael F., the features seem very dependent on the small quantity of the available training material, and the grandiloquent and religious nature of the few material available, with features such as ‘god’ (‘God’), ‘hty’ (‘almighty’), ‘lib’ (‘liberty’).

Finally and more importantly, these features, in their variation between analyses, give very good insight in the different results concerning Paul F. In the small corpus, due to the exclusion of his book, the most distinctive features for him are all cursory words and racist insults (‘ fu’, ‘fuc’, ‘uck’, ‘shi’, ‘hit’, ‘ ni’, ‘nig’, ‘igg’, ‘gge’, etc.); on the larger corpus, on the other hand, with the book included, they seem revealing of more neutral idiolectal (and grammatical) features, with pronouns, auxiliaries, determiners ( ‘he ’, ‘had’, ‘was’, ‘the’, etc.). These elements point to the larger corpus analysis being more reliable in what concerns Paul F. (especially in a crossgenre setup) than the smaller corpus analysis.

This paper is available on arxiv under CC BY 4.0 DEED license.

Machine Learning and Linguistic Profiles Sheds Light on Q's Possible Authors

Too Long; Didn't Read

Table of Links

Results

About Author

TOPICS

THIS ARTICLE WAS FEATURED IN...

Categories

Trending Topics

Machine Learning and Linguistic Profiles Sheds Light on Q's Possible Authors

Too Long; Didn't Read

Table of Links

Results

About Author

TOPICS

THIS ARTICLE WAS FEATURED IN...

RELATED STORIES

Categories

Trending Topics