paint-brush
QAnon Text Analysis shows Imbalanced Datasets and Generic differencesby@ethnology

QAnon Text Analysis shows Imbalanced Datasets and Generic differences

by Ethnology TechnologyDecember 7th, 2024
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

The difficulties in data collection for a variety of individual and profiles, as well as the ubiquity of deleted content, not recoverable to us, forces us to adopt a a dual approach. We build two corpora: large corpus in which we include the larger number of candidates, whatever the number of samples available to us and the genre of said samples.controlled corpus. We removed authors for which only too cross-genre and/or too few samples are available, and do not include training material that is too different from the rest.
featured image - QAnon Text Analysis shows Imbalanced Datasets and Generic differences
Ethnology Technology HackerNoon profile picture

Authors:

(1) Florian Cafiero (ORCID 0000-0002-1951-6942), Sciences Po, Medialab;

(2) Jean-Baptiste Camps (ORCID 0000-0003-0385-7037), Ecole nationale des chartes, Universite Paris, Sciences & Lettres.

Abstract and Introduction

Why work on QAnon? Specificities and social impact

Who is Q? The theories put to test

Authorship attribution

Results

Discussion

Corpus constitution

Quotes of authors outside of the corpus have been

Definition of two subcorpus: dealing with generic difference and an imbalanced dataset

The genre of “Q drops”: a methodological challenge

Detecting style changes: rolling stylometry

Ethical statement, Acknowledgements, and References

Definition of two subcorpus: dealing with generic difference and an imbalanced dataset

The difficulties in data collection for a variety of individual and profiles, as well as the ubiquity of deleted content, not recoverable to us, forces us to adopt a a dual approach, and to build two corpora:


large corpus in which we include the larger number of candidates, whatever the number of samples available to us and the genre of said samples. It contains everything described in the contains everything described in the Corpus constitution section above;


controlled corpus in which we removed authors for which only too cross-genre and/or too few samples are available, and do not include training material that is too different from the rest (in particular, books). It is the same as the previous one, minus


• interviews transcripts (Michael F., Paul F.);


• a book by Paul F.;


• the small amount of available data for Courtney T. and Roger S.


In both cases, due to the limitations in data collection and available material, the quantity of training material is imbalanced between authors, a potential problem in machine learning. To counter this effect, we used class weights during training, where errors for a given class are penalised not always by one, but by a specific weight inversely proportional to class size, where the weight for class i is computed as



where N is the total number of samples, and C the total number of unique classes and ni the number of samples in class i. Xe used the sklearn ‘balance’ implementation (Pedregosa et al., 2011).


This paper is available on arxiv under CC BY 4.0 DEED license.