Authors:
(1) Florian Cafiero (ORCID 0000-0002-1951-6942), Sciences Po, Medialab;
(2) Jean-Baptiste Camps (ORCID 0000-0003-0385-7037), Ecole nationale des chartes, Universite Paris, Sciences & Lettres.
Why work on QAnon? Specificities and social impact
Who is Q? The theories put to test
Quotes of authors outside of the corpus have been
Definition of two subcorpus: dealing with generic difference and an imbalanced dataset
The genre of “Q drops”: a methodological challenge
Detecting style changes: rolling stylometry
Ethical statement, Acknowledgements, and References
The difficulties in data collection for a variety of individual and profiles, as well as the ubiquity of deleted content, not recoverable to us, forces us to adopt a a dual approach, and to build two corpora:
large corpus in which we include the larger number of candidates, whatever the number of samples available to us and the genre of said samples. It contains everything described in the contains everything described in the Corpus constitution section above;
controlled corpus in which we removed authors for which only too cross-genre and/or too few samples are available, and do not include training material that is too different from the rest (in particular, books). It is the same as the previous one, minus
• interviews transcripts (Michael F., Paul F.);
• a book by Paul F.;
• the small amount of available data for Courtney T. and Roger S.
In both cases, due to the limitations in data collection and available material, the quantity of training material is imbalanced between authors, a potential problem in machine learning. To counter this effect, we used class weights during training, where errors for a given class are penalised not always by one, but by a specific weight inversely proportional to class size, where the weight for class i is computed as
where N is the total number of samples, and C the total number of unique classes and ni the number of samples in class i. Xe used the sklearn ‘balance’ implementation (Pedregosa et al., 2011).
This paper is available on arxiv under CC BY 4.0 DEED license.