Attacking Toxic Comments Kaggle Competition Using Fast.ai K is a good place to learn and practice your Machine Learning skills. It’s also a great place to find the proper dataset for your learning projects. I need a good classification NLP dataset to practice my recently learned fast.ai lesson, and I came across the . The competition is held two years ago and has long concluded, but it doesn’t hurt to submit my scores and see how well I did. This is one of the things Kaggle is great for since in the real world, it will usually be much harder to know how good or bad your model is, whereas, in Kaggle, you’ll see clearly where your performance is in the Leaderboard. aggle Toxic Comment Classification Challenge The Data Set This competition is held by The team, a research initiative founded by and Google (both a part of Alphabet). Its goal is to find out the best model that can classify multiple toxicity types in comments. The toxicity types are: Conversation AI Jigsaw toxic severe_toxic obscene threat insult indentity_hate Comments are given in a training file and a testing file . And you’ll need to predict a probability of each type of toxicity for each comment in . It is a multi-label NLP classification problem. train.cvs test.csv test.csv Look at the Data Let’s first take a look at the data. We need to import the necessary modules and do some logistics to set up the paths for our files. numpy np pandas pd fastai.text * fastai * import as # linear algebra import as # data processing, CSV file I/O (e.g. pd.read_csv) from import from import Notice here we imported everything from fastai.text and fastai modules. Are we against the software engineering best practice here? Actually, not quite. It’s rather a deliberate move in a more iterative and interactive data science kind of way. With all the library available, I can easily test and try different functions/modules without having to go back and import them every time. It will make the explore/experiment flow much more smoothly. But I digressed, let’s load the data and look at it: path = Path( )
path.ls() !mkdir data
!cp -a {path}/*.* ./data/
!ls data path = Path( )
path.ls() df = pd.read_csv(path/ )
df.head() # Kaggle store dataset in the /kaggle/input/ folder, '/kaggle/input/jigsaw-toxic-comment-classification-challenge/' # the /kaggle/input/ folder is read-only, copy away so I can also write to the folder. # make sure everything is correctly copied over '/kaggle/working/data/' # read in the data and have a peak 'train.csv' (The toxicity types are one-hot encoded) The comments are in column and all toxicity types are ‘one-hot’ encoded, we’ll have to do something about it to make it fit into our model later. comment_text (Have a look at one comment) Transfer Learning: Fine-Tune Our Language Model We’ll use transfer learning for this task, to do that, we’ll use a pre-trained model based on Wikipedia called . It is a model that’s already trained from the Wikipedia dataset(or ‘corpus’ in NLP terms) to predict the next words from a giving unfinished sentence. We’ll leverage the ‘language knowledge’ the model already learned from the Wikipedia dataset and build on top of that. To achieve the best results, we’ll need to ‘fine-tune’ the model to make it learn a bit from our ‘comments’ dataset since what people say in the comments are not necessarily the same with the more formal Wiki. Once the language model is fine-tuned, we can then use it to further do our classification task. wikitext-103 Now let’s load the training data into the fast.ai so we can start training the language model first. databunch bs = data_lm = (TextList.from_df(df, path, cols= )
                .split_by_rand_pct( )
                .label_for_lm()
                .databunch(bs=bs)) 64 # set batch size to 64, works for Kaggle Kernels 'comment_text' 0.1 We use fast.ai’s for this task. It is a very flexible and powerful way to address the challenging task of building a pipeline: loading your data into the model. It isolates the entire process into different parts/steps, each step with multiple methods/functions to adapt to different types of data and the ways data is stored. This concept is a lot like the Linux philosophy, highly modulized and with each module only do one thing but really really well. You are free to explore the wonderful API , for the above code though, it does the following things: Data Block API here 1. Import data from Pandas DataFrame named , tell the model to use as input ( ) df comment_text TextList.from_df(df, path, cols=’comment_text’) Note here I can also include the test.csv into the language model. It’s not considered ‘cheating’ since we are not using the labels, just do language model training. 2. Split the training dataset into train/validation set by random 10/90 percent. ( ) .split_by_rand_pct(0.1) 3. Ignore the given labels( since we are only fine-tuning the language model, not training the classifier yet) and use the language model’s ‘predict next word’ as labels. ( ) .label_for_lm() 4. Build the data into a , with batch size . ( ) databunch bs .databunch(bs=bs) Now let’s look at the we just built: databunch (Notice we lost all the toxicity types) Notice that the doesn’t have all the toxicity type labels since we are only fine-tuning the language model. databunch OK, time for some typical fast.ai learning rate adjustments and training: We put our into a , tell it the language model base we want to use ( ) and assign a default dropout rate of . From the graph, find the biggest downward slope and pick the middle point as our learning rate. (For a more detailed explanation of how this ‘fit_one_cycle’ magic is done, please refer to this . It is a SOTA technique of fast.ai that combines learning rate and momentum annealing). Now we can ‘unfreeze’ the model and train the entire model couple of epochs: databunch language_model_learner AWD_LSTM 0.3 LR Finder article We can look at one example of how well the model did: The result is hardly optimal. But we at least get a sentence that actually makes sense and 0.38 accuracy for predicting the next word is not bad. Ideally, we need to train a bit more epochs but for this Kaggle Kernel, I was running out of GPU quota so I stopped at 4. The result definitely has room to improve and you can try it yourself. Anyway, what we want from the language model is the encoder part, so we save it. Training the language model does take quite some time, but the good news is, for your own domain corpus, you only have to train once and later you can use it as a base for any other classification tasks. learn.save_encoder( ) # save the encoder for next step use 'fine_tuned_enc' Transfer Learning: Training the Classifier Let’s read in the test dataset: test = pd.read_csv(path/ )
test_datalist = TextList.from_df(test, cols= ) "test.csv" 'comment_text' Again, build our : databunch data_cls = (TextList.from_csv(path, , cols= , vocab=data_lm.vocab)
                .split_by_rand_pct(valid_pct= )
                .label_from_df(cols=[ , , , , , ], label_cls=MultiCategoryList, one_hot= )
                .add_test(test_datalist)
                .databunch())
data_cls.save( ) 'train.csv' 'comment_text' 0.1 'toxic' 'severe_toxic' 'obscene' 'threat' 'insult' 'identity_hate' True 'data_clas.pkl' Please note the difference this time: 1. When building the TextList , we specified vocab=data_lm.vocab , this way we make sure we are using the same vocabulary and our training on the language model can be properly applied to the classifier model. 2. We now use all our toxicity styles labels ( ) .label_from_df(cols=[‘toxic’, ‘severe_toxic’,’obscene’, ‘threat’, ‘insult’, ‘identity_hate’],label_cls=MultiCategoryList, one_hot=True), 3. We added our test set here. ( ) .add_test(test_datalist) Now look at our classifier : databunch (Note that now we have all the toxicity styles labels) We’ll put the into the model and load the encoder we learned from the language model. Finally, time to put everything together! databunch text_classifier_learner learn = text_classifier_learner(data_clas, AWD_LSTM, drop_mult= )
learn.load_encoder( ) 0.5 'fine_tuned_enc' Again, find the best learning rate and train one cycle: Train a bit more cycles and unfreeze: See the results: Off by one, but overall the prediction is OK. For the purpose of reference, I submitted the prediction to Kaggle and get a 0.98098 Public Score (land in the middle of the Public Leader Board). The result is not optimal but like I said I didn’t train all the way due to limited GPU. The purpose of this article is to show you the whole process of using fast.ai to tackle multi-labels text classification problem. The real challenge here is to load the data into the model using Data Block API. Conclusion I hope you learned a thing or two from this article. Fast.ai is really a lean, flexible and powerful library. For the things it can do (like image/text classification, tabular data, collaborative filtering, etc.), it does it very well. It is not as extensive as Keras, but it’s very sharp and focused. Kind of like Vim and Emacs if you are familiar with the command line text editor war. 😜 You can find the Kaggle Kernel . here Any feedback or constructive criticism is welcomed. You can either find me on Twitter or my blog site . @lymenlee wayofnumbers.com

Flow

The Graph

Alphabet

Google

Twitter

How to Extract Knowledge from Wikipedia, Data Science Style

How to Structure a PyTorch ML Project With Google Colab and TensorBoard

Visit My Blog!

Read My Stories

Too Long; Didn't Read

How to Build a Multi-label NLP Classifier from Scratch

How to Build a Multi-label NLP Classifier from Scratch

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

5 Things I Learned from Google’s New ML-Powered Recorder App

The Noonification: Use This 7-Step McKinsey Framework to Solve Any Problem (1/10/2023)

The Noonification: A Taxonomy of Inclusiveness (1/11/2024)

The Noonification: What is the InfiniteNature-Zero AI Model? (11/19/2022)

10 Ways AI Has Changed Our Lives

100 Days of AI, Day 8: Experimenting With Microsoft's Semantic Kernel Using GPT-4

5 Things I Learned from Google’s New ML-Powered Recorder App

The Noonification: Use This 7-Step McKinsey Framework to Solve Any Problem (1/10/2023)

The Noonification: A Taxonomy of Inclusiveness (1/11/2024)

The Noonification: What is the InfiniteNature-Zero AI Model? (11/19/2022)

10 Ways AI Has Changed Our Lives

100 Days of AI, Day 8: Experimenting With Microsoft's Semantic Kernel Using GPT-4

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps