I’ve overcome my skepticism about fast.ai for production and trained a text classification system in non-English language, small dataset and lots of classes with ULMFiT.
About the project
My friend and classmate, who is one of the founders of RocketBank (leading online-only bank in Russia), asked me to develop a classifier to help first-line of customer support.
The initial set of constraints was pretty restricted —
- no labeled historical data
- obfuscated personally identifiable information
- “we want it yesterday”
- mostly Russian language (hard to find pre-trained model)
- no access to cloud solutions due to privacy regulations
- ability to retrain the model if new classes arise without my involvement
The scope of work was pretty straight forward — develop a model and serving solution for incoming messages classification into 25 classes.
Initially, after thinking about restrictions — I was pretty sure, that no neural networks should be used for this approach. Why?
- We would never label enough data for the neural model in reasonable time.
- Building an environment for the reliable serving of neural model is a kind of pain.
- I was skeptical about reaching the requested performance (requests per second) with reasonable resources.
Based on that, I made a sad face and created new conda environment. I was classic.
Dataset collection and initial research.
RocketBank had set up a task force consisting of a project manager and devops on their side plus a bunch of people handling dataset labeling. It was extremely smart and helpful, as, in my opinion, this constitutes a perfect team for handling a data science project in the industrial world.
We started with analyzing historical data and came to a number of conclusions:
- To train a system we take into account only messages received by the bank before any response from customer support.
- There are 2 distinct meta-classes of incoming messages — coming from existing customers and from new leads. Adding this information as an input to classifier should provide additional information to the system and boost classification scores.
- Bank on their side decided on 25 distinct classes of messages ranging from ‘Credit request’ up to ‘Harassment’.
I requested around 25000 not labeled historical messages and in just a few days a task force was able to classify around 1500 historical messages into 25 classes. Initially, I assumed that this number (1.5k) is too low to even try any neural model (I was wrong).
TD-IDF + TPOT and DEAP
I will fast forward through non-interesting part of the thing. I decided to test various flavors of TD-IDF, embeddings and optimize machine learning model using TPOT and DEAP.
TPOT and DEAP, for those unaware, are two secret weapons in data scientist arsenal that make model search CPU-intensive and hands-free.
TPOT runs all stack of machine learning methods embedded in sklearn plus few extra(XGBoost) and finds the optimal pipeline. I played around various embeddings, fed them into TPOT and after 24 hours said that Random Forest performs best for my model (ha-ha, what a surprise!).
Then I needed to find an optimal set of hyperparameters. I always do this using directed evolution strategies with DEAP library. That actually deserves a separate post.
Anyway, at the end of the day, I received an optimal set of settings and my precision was around 63%. I think this was close to the maximum that I was able to get from classical methods and 1.5k dataset. While 63% for 25 classes sounds good from the machine learning perspective, it's quite bad for real-world usage. So, I decided to take a look into neural nets as a last chance.
Fast.ai comes into play.
So, I needed a fast way to check the performance of a neural-based model on the same task. While implementing a model from scratch using Tensorflow was the most viable option, I decided to run a fast test with fast.ai and their recent discovery of ULMFiT. Problem is — I needed a pretrained language model for Russian text, which isn’t available in fast.ai. After looking at fast.ai forums I discovered an ongoing effort to create a set of a language model for most languages. There was a thread for Russian language and a pre-trained model from a Russian Kaggler Pavel Pleskov, that he used to get a second place at Yandex competition.
From there it was mostly writing 20 lines of code and few hours of GPU training time to get to 70% precision. After a few more days of tuning hyperparameters, I get to 80% precision. Some tips:
- Use FocalLoss as a training goal.
- Have a pretrained language model, but finetune it on a non-labeled data available.
- Convert text to lowercase and make a token for uppercase, make a special token for pieces of obfuscated data.
- Put token of meta-class not only in the beginning but also at the end of the message.
Ok, great. Should I convert the model into TensorFlow? Nope, I was lazy and decided to test model performance using native Fast.ai + Pytorch + Docker.
After running stress-tests in a single-core Docker container I was surprised to see less than 300 milliseconds response time for an average request and no crashes. What else did I need? Nothing.
Fast.ai showed that it is a perfect solution for fast and precise development of production ML systems.
The beauty of fast.ai + transfer learning is a pretty predictable result of retraining in terms of quality and speed. I’ve shared a script inside the docker container coping my final training notebook and providing a new model as an asset. I’ve run a few cycles of retraining and cross-validation and obtained highly repeatable results, so this is a simple way to deliver not only a model but a training script as well.
I can’t share the actual code and system configs, but I am ready to answer any questions.