Authors:
(1) Florian Cafiero (ORCID 0000-0002-1951-6942), Sciences Po, Medialab;
(2) Jean-Baptiste Camps (ORCID 0000-0003-0385-7037), Ecole nationale des chartes, Universite Paris, Sciences & Lettres. Table of Links Abstract and Introduction Why work on QAnon? Specificities and social impact Who is Q? The theories put to test Authorship attribution Results Discussion Corpus constitution Quotes of authors outside of the corpus have been Definition of two subcorpus: dealing with generic difference and an imbalanced dataset The genre of “Q drops”: a methodological challenge Detecting style changes: rolling stylometry Ethical statement, Acknowledgements, and References Definition of two subcorpus: dealing with generic difference and an imbalanced dataset The difficulties in data collection for a variety of individual and profiles, as well as the ubiquity of deleted content, not recoverable to us, forces us to adopt a a dual approach, and to build two corpora: large corpus in which we include the larger number of candidates, whatever the number of samples available to us and the genre of said samples. It contains everything described in the contains everything described in the Corpus constitution section above; controlled corpus in which we removed authors for which only too cross-genre and/or too few samples are available, and do not include training material that is too different from the rest (in particular, books). It is the same as the previous one, minus • interviews transcripts (Michael F., Paul F.); • a book by Paul F.; • the small amount of available data for Courtney T. and Roger S. In both cases, due to the limitations in data collection and available material, the quantity of training material is imbalanced between authors, a potential problem in machine learning. To counter this effect, we used class weights during training, where errors for a given class are penalised not always by one, but by a specific weight inversely proportional to class size, where the weight for class i is computed as where N is the total number of samples, and C the total number of unique classes and ni the number of samples in class i. Xe used the sklearn ‘balance’ implementation (Pedregosa et al., 2011). This paper is available on arxiv under CC BY 4.0 DEED license. Authors: (1) Florian Cafiero (ORCID 0000-0002-1951-6942), Sciences Po, Medialab; (2) Jean-Baptiste Camps (ORCID 0000-0003-0385-7037), Ecole nationale des chartes, Universite Paris, Sciences & Lettres. Authors: Authors: (1) Florian Cafiero (ORCID 0000-0002-1951-6942), Sciences Po, Medialab; (2) Jean-Baptiste Camps (ORCID 0000-0003-0385-7037), Ecole nationale des chartes, Universite Paris, Sciences & Lettres. Table of Links Abstract and Introduction Abstract and Introduction Why work on QAnon? Specificities and social impact Why work on QAnon? Specificities and social impact Who is Q? The theories put to test Who is Q? The theories put to test Authorship attribution Authorship attribution Results Results Discussion Discussion Corpus constitution Corpus constitution Quotes of authors outside of the corpus have been Quotes of authors outside of the corpus have been Definition of two subcorpus: dealing with generic difference and an imbalanced dataset Definition of two subcorpus: dealing with generic difference and an imbalanced dataset The genre of “Q drops”: a methodological challenge The genre of “Q drops”: a methodological challenge Detecting style changes: rolling stylometry Detecting style changes: rolling stylometry Ethical statement, Acknowledgements, and References Ethical statement, Acknowledgements, and References Definition of two subcorpus: dealing with generic difference and an imbalanced dataset The difficulties in data collection for a variety of individual and profiles, as well as the ubiquity of deleted content, not recoverable to us, forces us to adopt a a dual approach, and to build two corpora: large corpus in which we include the larger number of candidates, whatever the number of samples available to us and the genre of said samples. It contains everything described in the contains everything described in the Corpus constitution section above; large corpus controlled corpus in which we removed authors for which only too cross-genre and/or too few samples are available, and do not include training material that is too different from the rest (in particular, books). It is the same as the previous one, minus controlled corpus • interviews transcripts (Michael F., Paul F.); • a book by Paul F.; • the small amount of available data for Courtney T. and Roger S. In both cases, due to the limitations in data collection and available material, the quantity of training material is imbalanced between authors, a potential problem in machine learning. To counter this effect, we used class weights during training, where errors for a given class are penalised not always by one, but by a specific weight inversely proportional to class size, where the weight for class i is computed as where N is the total number of samples, and C the total number of unique classes and ni the number of samples in class i. Xe used the sklearn ‘balance’ implementation (Pedregosa et al., 2011). This paper is available on arxiv under CC BY 4.0 DEED license. This paper is available on arxiv under CC BY 4.0 DEED license. available on arxiv

Part of HackerNoon's growing list of open-source research papers, promoting free access to academic material.

QAnon Text Analysis shows Imbalanced Datasets and Generic differences

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

A Roadmap for Addressing Critical Challenges in Human-Machine Social Systems

Social Impacts, Global Reach, and the Mystery of 'Q'

Linguistic Analysis Reveals Authorship Changes and Collaboration in QDrop Posts

Challenges in Building a QAnon Authorship Corpus

Analyzing QDrops Texts shows Distinct Linguistic Features than Traditional Ones

Rolling Stylometry and Machine Learning Analyzes QAnon Texts Patterns

A Roadmap for Addressing Critical Challenges in Human-Machine Social Systems

Social Impacts, Global Reach, and the Mystery of 'Q'

Linguistic Analysis Reveals Authorship Changes and Collaboration in QDrop Posts

Challenges in Building a QAnon Authorship Corpus

Analyzing QDrops Texts shows Distinct Linguistic Features than Traditional Ones

Rolling Stylometry and Machine Learning Analyzes QAnon Texts Patterns

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps