I wanted to talk a little about one thing I've noticed recently. This is essentially subtweeting on a bunch of people I see on different platforms, who seem to be preoccupied with LLMs (and huge models in general) and appear to claim that they can now solve everything. However, it's important to not forget about the "classical" ML/NLP methods, since they are still powerful for many well-established tasks.
What this post will contain: a brief comparison of sentiment analysis – a well-known and studied NLP problem – performance using four methods: two LLM-powered ones, and two classical ones (for a good measure)
Zero-Shot LLM Classification
Few-Shot LLM Classification
TF-IDF + Support Vector Classifier
TF-IDF + Decision Tree Classifier
What this post will not contain:
A tutorial on sentiment analysis, NLP, or ML in general. If you need a refresher on these topics I can suggest geeks for geeks, it's quite alright.
It will not be scientifically rigorous in the slightest. I'm using a toy dataset, the implementation is quite hand-wavy and nowhere near optimal. I've essentially had a couple of hours of spare time, to cook this up, so here we are.
Sentiment Analysis is a well-known problem – given some text, we want to mark it as positive, negative, or neutral. I've semi-randomly picked this Twitter sentiment dataset from Kaggle. It has three classes (Negative, Neutral, Positive). It comes with a pre-split for training (74000 tweets) and testing (1000 tweets).
I will use accuracy, recall (one-vs-rest per class), and precision (one-vs-rest per class) to assess test performance.
I will also record the time of training and inference, as well as the cost of inference.
For both LLM methods, I will be using OpenAI's API for the gpt-3.5-turbo-instruct model.
The first and obvious method to try is zero-shot classification: ask an LLM to generate a response, given a prompt of the following form:
Tweet: {tweet_text}
Sentiment:
Then, we just match responses to Positive, Negative, or Neutral (regarding anything other than these three as Unknown). For zero- and few-shot we are not using the training set. So let's see how it performs on our test set.
Accuracy: 0.509
Recall
Negative 0.729323
Neutral 0.271335
Positive 0.689531
dtype: float64
Precision
Negative 0.557471
Neutral 0.635897
Positive 0.502632
dtype: float64
|
Negative |
Neutral |
Positive |
Unknown |
---|---|---|---|---|
Negative |
194 |
31 |
17 |
24 |
Neutral |
123 |
124 |
172 |
38 |
Positive |
31 |
40 |
191 |
15 |
Well, it's not perfect. It's alright, there's a decent recall in Negative and Positive cases. However, many cases are Unknown. This is one of the major drawbacks: out-of-the-box LLMs can generate pretty much anything, and match free-text responses to our discrete classes.
On top of that, processing 1000 test cases took 4 minutes and 12 seconds. The number of processed tokens is ~44000, which results in about $0.066 (and that's just for one run, I've executed the whole notebook a few times while experimenting). Let's see if we can improve that with the few-shot classification.
What's usually suggested you do next then?
Improve the prompt and give the model the model some examples to work with – the few-shot learning. So, taking one example of each class from the training set, we'll use the following template.
Tweet: <negative training sample>
Sentiment: Negative
Tweet: <neutral training sample>
Sentiment: Neutral
Tweet: <positive training sample>
Sentiment: Positive
Tweet: {tweet_text}
Sentiment:
Then we do the same matching procedure. So, let's see the performance on the test set
Accuracy: 0.574
Recall
Negative 0.710526
Neutral 0.553611
Positive 0.476534
dtype: float64
Precision
Negative 0.564179
Neutral 0.569820
Positive 0.597285
dtype: float64
|
Negative |
Neutral |
Positive |
Unknown |
---|---|---|---|---|
Negative |
189 |
71 |
6 |
0 |
Neutral |
121 |
253 |
83 |
0 |
Positive |
25 |
120 |
132 |
0 |
Well, the performance is somewhat better. The accuracy improved. The recall for the positive case is worse. We get rid of the Unknown predictions, but it's still not too great.
The time and costs though are 4 minutes 37 seconds and 161k tokens * $0.0015/per 1k = $0.2145! So, we've still got a subpar sentiment classifier, that takes a significant time and costs over 20 US cents to classify 1000 tweets. Can we do better? Why yes, let's turn back to some classical ML methods instead.
One of the standard approaches to sentiment analysis is to turn text into vectors somehow (we'll use one of the simplest approaches – TF-IDF vectorization), and feed it to some classification model (we'll use SVM and Decision Trees). I thought that taking the methods that are usually discussed in the intro to ML and NLP undergrad courses, rather than something sophisticated, would make my point even more clear.
Let's see what we get with those. First of all, vectorization. TF-IDF takes 1.48 seconds to process both training and test sets (total of 75000 tweets!)
Training and inference times for SVC and Decision Trees are recorded below
Method |
Training time |
Inference time |
---|---|---|
SVC |
3.28s |
1.44ms |
Tree |
23.3s |
2.4ms |
Blazing fast! And that's without using any GPU.
It will even run on a few-years-old laptop without any issues. Let's now have a look at whether we lose anything in terms of performance.
Method |
Accuracy |
Recall (Neg/Neu/Pos) |
Precision (Neg/Neu/Pos) |
---|---|---|---|
SVC |
0.85 |
0.86/0.85/0.83 |
0.84/0.87/0.82 |
Tree |
0.92 |
0.94/0.91/0.91 |
0.88/0.93/0.92 |
On the contrary, the performance is way better. The decision Tree method is a touch better on all scores, but it takes more time to train (albeit, it's still under a minute). And it cost me virtually nothing (I run everything inside a Colab notebook). So, what does it give us?
Performance:
Method |
Accuracy |
Recall (Neg/Neu/Pos) |
Precision (Neg/Neu/Pos) |
---|---|---|---|
Zero-shot LLM |
0.51 |
0.73/0.27/0.69 |
0.56/0.64/0.50 |
Few-shot LLM |
0.57 |
0.71/0.55/0.47 |
0.56/0.57/0.60 |
SVC |
0.85 |
0.86/0.85/0.83 |
0.84/0.87/0.82 |
Tree |
0.92 |
0.94/0.91/0.91 |
0.88/0.93/0.92 |
Times and costs:
Method |
Training time |
Inference time |
Cost |
---|---|---|---|
Zero-shot LLM |
n/a |
5min 51s |
44k*0.0015=$0.066 |
Few-shot LLM |
n/a |
5min 42s |
161k*0.0015=$0.2145 |
TF-IDF + SVC |
4.76s |
1.44ms |
Practically zero |
TF-IDF + Tree |
24.8s |
2.4ms |
Practically zero |
Well, both classical methods surpass the LLM methods in test performance, time taken, and costs. Huh.
One could say that I didn't tune the LLM methods well enough, and I could improve the outcome by prompt engineering, further tweaking, or even fine-tuning. But that's the point. I've already spent $3.75 on OpenAI API today. Why spend even more, if I can achieve rather good results with simple sklearn-provided models (and I could make it even better by spending time tweaking SVMs or Decision Trees, yay)?
One could say that I could use some open-source LLM from Huggingface Hub, and host it on my own hardware, instead of paying for the API. Sure, but I could run the SVM on a Raspberry Pi Zero easily. In contrast, the smallest transformer model capable of such a task will require quite a bit more computing (GPUs are recommended, and a significant amount of RAM is needed). And this computing power still costs money, you know.
One could say that I use a task that's just not very suitable for LLMs. Exactly, that's the point. You can sure try doing that, and in some cases, it might work, but it will likely not be very efficient at it. There are some other tasks that LLMs are very capable of, and other methods just don't do a good job, surely enough. So let's use LLM for these tasks then.
The main point that I want the reader to take away here is that, before grabbing onto an LLM to solve your new task on hands, I want you to consider, if there's a good simple, lightweight, and well-researched method for the thing you want to do. Maybe there indeed is.
This post is directed at nobody specifically but is inspired by the frequent misuse of technology by people on the Internet. The code for this set of experiments is available in the Colab Notebook (provide your dataset files and OpenAI API keys).
Also published here.
Lead image by Krzysztof Niewolny / Unsplash