We all remember Tay, the chatbot Microsoft released on Twitter that quickly turned into a racist, misogynist, and deviant neo-Nazi. This is what happens when ill-intentioned individuals take advantage of an “innocent” machine learning algorithm. Before unleashing Tay to the open world, its algorithm was trained to have the personality of a fun and loving 16-year-old. I recall finding it quite fascinating (and funny) to see how quickly the internet could turn a teen-talking chatbot into an AI nightmare.
Which made me think; what if Microsoft had kept Tay online. What if we do it intentionally by training its algorithm to be the worst it can be…
Hence this project, where I’ll intend to train a Markov chain on datasets generated using 4chan APIs :
APIs, Dataset Generation
4chan is a very special place, it represents a social entity unlike any other. With the cover of anonymity and the absence of moderation, you can say almost anything you want without fear of retribution resulting in blatant racism, homophobia, gore, etc
Based on our goal of creating the worst AI chatbot, 4chan is the perfect candidate to generate our Datasets from.
Like reddit, 4chan is divided into various boards with their own specific content and community of users. As a result, each dataset will be generated using only one board as Input :
In 2012, the 4chan team released a set of APIs to facilitate the work of developers when scrapping the site. As shown above, I use it to retrieve all the posts from the first 5 page of the corresponding board (passed as a parameter). After cleaning (parser function) the data, I finish by writing the results into a file (ex : ./data/fit.txt).
Markov Chain Text Generator
For a given system or process, a markov chain is basically a sequence of states with probabilities associated to the transition between two states.
In the context of Text generation, a markov chain will help you determine the next most probable suffix word for a given prefix. In order to produce good results, it is important to provide the algorithm with relatively big training sets. By fetching all the posts from the first 5 pages of a given board, we get around 50000 words per dataset.
As a parameter to our markov chain constructor (NewMarkovFromFile), we give 2 as default value for the prefix length (n) which is the recommended value to prevent the output text from seeming too random, or too close to the original text.
Considering the nature of the content, this is not safe for work:
What’s next ?
This is just the first part of the project, a lot more can be done :