Evaluating Sentiment Analysis Performance: LLMs vs Classical ML

I wanted to talk a little about one thing I've noticed recently. This is essentially subtweeting on a bunch of people I see on different platforms, who seem to be preoccupied with LLMs (and huge models in general) and appear to claim that they can now solve everything. However, it's important to not forget about the "classical" ML/NLP methods, since they are still powerful for many well-established tasks.

What this post will contain: a brief comparison of sentiment analysis – a well-known and studied NLP problem – performance using four methods: two LLM-powered ones, and two classical ones (for a good measure)

Zero-Shot LLM Classification
Few-Shot LLM Classification
TF-IDF + Support Vector Classifier
TF-IDF + Decision Tree Classifier

What this post will not contain:

A tutorial on sentiment analysis, NLP, or ML in general. If you need a refresher on these topics I can suggest geeks for geeks, it's quite alright.
It will not be scientifically rigorous in the slightest. I'm using a toy dataset, the implementation is quite hand-wavy and nowhere near optimal. I've essentially had a couple of hours of spare time, to cook this up, so here we are.

Problem setting

Sentiment Analysis is a well-known problem – given some text, we want to mark it as positive, negative, or neutral. I've semi-randomly picked this Twitter sentiment dataset from Kaggle. It has three classes (Negative, Neutral, Positive). It comes with a pre-split for training (74000 tweets) and testing (1000 tweets).

I will use accuracy, recall (one-vs-rest per class), and precision (one-vs-rest per class) to assess test performance.

I will also record the time of training and inference, as well as the cost of inference.

For both LLM methods, I will be using OpenAI's API for the gpt-3.5-turbo-instruct model.

Experiments

Zero-shot LLM Classification

The first and obvious method to try is zero-shot classification: ask an LLM to generate a response, given a prompt of the following form:

Tweet: {tweet_text}
Sentiment:

Then, we just match responses to Positive, Negative, or Neutral (regarding anything other than these three as Unknown). For zero- and few-shot we are not using the training set. So let's see how it performs on our test set.

Accuracy: 0.509

Recall

Negative    0.729323
Neutral     0.271335
Positive    0.689531
dtype: float64

Precision

Negative    0.557471
Neutral     0.635897
Positive    0.502632
dtype: float64

	Negative	Neutral	Positive	Unknown
Negative	194	31	17	24
Neutral	123	124	172	38
Positive	31	40	191	15

Well, it's not perfect. It's alright, there's a decent recall in Negative and Positive cases. However, many cases are Unknown. This is one of the major drawbacks: out-of-the-box LLMs can generate pretty much anything, and match free-text responses to our discrete classes.

On top of that, processing 1000 test cases took 4 minutes and 12 seconds. The number of processed tokens is ~44000, which results in about $0.066 (and that's just for one run, I've executed the whole notebook a few times while experimenting). Let's see if we can improve that with the few-shot classification.

Few-shot LLM Classification

What's usually suggested you do next then?

Improve the prompt and give the model the model some examples to work with – the few-shot learning. So, taking one example of each class from the training set, we'll use the following template.

Tweet: <negative training sample>
Sentiment: Negative
Tweet: <neutral training sample>
Sentiment: Neutral
Tweet: <positive training sample>
Sentiment: Positive
Tweet: {tweet_text}
Sentiment:

Then we do the same matching procedure. So, let's see the performance on the test set

Accuracy: 0.574

Recall

Negative    0.710526
Neutral     0.553611
Positive    0.476534
dtype: float64

Precision

Negative    0.564179
Neutral     0.569820
Positive    0.597285
dtype: float64

	Negative	Neutral	Positive
Negative	189	71	6
Neutral	121	253	83
Positive	25	120	132

Well, the performance is somewhat better. The accuracy improved. The recall for the positive case is worse. We get rid of the Unknown predictions, but it's still not too great.

The time and costs though are 4 minutes 37 seconds and 161k tokens * $0.0015/per 1k = $0.2145! So, we've still got a subpar sentiment classifier, that takes a significant time and costs over 20 US cents to classify 1000 tweets. Can we do better? Why yes, let's turn back to some classical ML methods instead.

Classical ML Methods for NLP

One of the standard approaches to sentiment analysis is to turn text into vectors somehow (we'll use one of the simplest approaches – TF-IDF vectorization), and feed it to some classification model (we'll use SVM and Decision Trees). I thought that taking the methods that are usually discussed in the intro to ML and NLP undergrad courses, rather than something sophisticated, would make my point even more clear.

Let's see what we get with those. First of all, vectorization. TF-IDF takes 1.48 seconds to process both training and test sets (total of 75000 tweets!)

Training and inference times for SVC and Decision Trees are recorded below

Method	Training time	Inference time
SVC	3.28s	1.44ms
Tree	23.3s	2.4ms

Blazing fast! And that's without using any GPU.

It will even run on a few-years-old laptop without any issues. Let's now have a look at whether we lose anything in terms of performance.

Method	Accuracy	Recall (Neg/Neu/Pos)	Precision (Neg/Neu/Pos)
SVC	0.85	0.86/0.85/0.83	0.84/0.87/0.82
Tree	0.92	0.94/0.91/0.91	0.88/0.93/0.92

On the contrary, the performance is way better. The decision Tree method is a touch better on all scores, but it takes more time to train (albeit, it's still under a minute). And it cost me virtually nothing (I run everything inside a Colab notebook). So, what does it give us?

Results

Performance:

Method	Accuracy	Recall (Neg/Neu/Pos)	Precision (Neg/Neu/Pos)
Zero-shot LLM	0.51	0.73/0.27/0.69	0.56/0.64/0.50
Few-shot LLM	0.57	0.71/0.55/0.47	0.56/0.57/0.60
SVC	0.85	0.86/0.85/0.83	0.84/0.87/0.82
Tree	0.92	0.94/0.91/0.91	0.88/0.93/0.92

Times and costs:

Method	Training time	Inference time	Cost
Zero-shot LLM	n/a	5min 51s	44k*0.0015=$0.066
Few-shot LLM	n/a	5min 42s	161k*0.0015=$0.2145
TF-IDF + SVC	4.76s	1.44ms	Practically zero
TF-IDF + Tree	24.8s	2.4ms	Practically zero

Well, both classical methods surpass the LLM methods in test performance, time taken, and costs. Huh.

Discussion

One could say that I didn't tune the LLM methods well enough, and I could improve the outcome by prompt engineering, further tweaking, or even fine-tuning. But that's the point. I've already spent $3.75 on OpenAI API today. Why spend even more, if I can achieve rather good results with simple sklearn-provided models (and I could make it even better by spending time tweaking SVMs or Decision Trees, yay)?

One could say that I could use some open-source LLM from Huggingface Hub, and host it on my own hardware, instead of paying for the API. Sure, but I could run the SVM on a Raspberry Pi Zero easily. In contrast, the smallest transformer model capable of such a task will require quite a bit more computing (GPUs are recommended, and a significant amount of RAM is needed). And this computing power still costs money, you know.

One could say that I use a task that's just not very suitable for LLMs. Exactly, that's the point. You can sure try doing that, and in some cases, it might work, but it will likely not be very efficient at it. There are some other tasks that LLMs are very capable of, and other methods just don't do a good job, surely enough. So let's use LLM for these tasks then.

The main point that I want the reader to take away here is that, before grabbing onto an LLM to solve your new task on hands, I want you to consider, if there's a good simple, lightweight, and well-researched method for the thing you want to do. Maybe there indeed is.

This post is directed at nobody specifically but is inspired by the frequent misuse of technology by people on the Internet. The code for this set of experiments is available in the Colab Notebook (provide your dataset files and OpenAI API keys).

Also published here.

Lead image by Krzysztof Niewolny / Unsplash

Evaluating Sentiment Analysis Performance: LLMs vs Classical ML

Too Long; Didn't Read

Problem setting

Experiments

Zero-shot LLM Classification

Few-shot LLM Classification

Classical ML Methods for NLP

Results

Discussion

About Author

TOPICS

THIS ARTICLE WAS FEATURED IN...

Trending Topics

Classic

Neon Noir

Minty

Newspaper

HN StartUps

Evaluating Sentiment Analysis Performance: LLMs vs Classical ML

Too Long; Didn't Read

Problem setting

Experiments

Zero-shot LLM Classification

Few-shot LLM Classification

Classical ML Methods for NLP

Results

Discussion

About Author

TOPICS

THIS ARTICLE WAS FEATURED IN...

RELATED STORIES

Trending Topics

Classic

Neon Noir

Minty

Newspaper

HN StartUps