paint-brush
Evaluating Sentiment Analysis Performance: LLMs vs Classical MLby@dlowl
113 reads

Evaluating Sentiment Analysis Performance: LLMs vs Classical ML

by D. LowlFebruary 1st, 2024
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

Tested sentiment analysis on Twitter data. LLMs took minutes and dollars per 1K tweets, classical ML took milliseconds and cents. SVM and decision trees beat LLMs on accuracy and metrics while being 100x faster and cheaper. Don't assume giant AI is the best tool - check if simpler methods work better first.
featured image - Evaluating Sentiment Analysis Performance: LLMs vs Classical ML
D. Lowl HackerNoon profile picture


I wanted to talk a little about one thing I've noticed recently. This is essentially subtweeting on a bunch of people I see on different platforms, who seem to be preoccupied with LLMs (and huge models in general) and appear to claim that they can now solve everything. However, it's important to not forget about the "classical" ML/NLP methods, since they are still powerful for many well-established tasks.


What this post will contain: a brief comparison of sentiment analysis – a well-known and studied NLP problem – performance using four methods: two LLM-powered ones, and two classical ones (for a good measure)


  • Zero-Shot LLM Classification

  • Few-Shot LLM Classification

  • TF-IDF + Support Vector Classifier

  • TF-IDF + Decision Tree Classifier


What this post will not contain:

  • A tutorial on sentiment analysis, NLP, or ML in general. If you need a refresher on these topics I can suggest geeks for geeks, it's quite alright.

  • It will not be scientifically rigorous in the slightest. I'm using a toy dataset, the implementation is quite hand-wavy and nowhere near optimal. I've essentially had a couple of hours of spare time, to cook this up, so here we are.



Problem setting

Sentiment Analysis is a well-known problem – given some text, we want to mark it as positive, negative, or neutral. I've semi-randomly picked this Twitter sentiment dataset from Kaggle. It has three classes (Negative, Neutral, Positive). It comes with a pre-split for training (74000 tweets) and testing (1000 tweets).


I will use accuracy, recall (one-vs-rest per class), and precision (one-vs-rest per class) to assess test performance.


I will also record the time of training and inference, as well as the cost of inference.

For both LLM methods, I will be using OpenAI's API for the gpt-3.5-turbo-instruct model.


Experiments

Zero-shot LLM Classification

The first and obvious method to try is zero-shot classification: ask an LLM to generate a response, given a prompt of the following form:

Tweet: {tweet_text}
Sentiment:


Then, we just match responses to Positive, Negative, or Neutral (regarding anything other than these three as Unknown). For zero- and few-shot we are not using the training set. So let's see how it performs on our test set.


Accuracy: 0.509

Recall

Negative    0.729323
Neutral     0.271335
Positive    0.689531
dtype: float64

Precision

Negative    0.557471
Neutral     0.635897
Positive    0.502632
dtype: float64


Negative

Neutral

Positive

Unknown

Negative

194

31

17

24

Neutral

123

124

172

38

Positive

31

40

191

15


Well, it's not perfect. It's alright, there's a decent recall in Negative and Positive cases. However, many cases are Unknown. This is one of the major drawbacks: out-of-the-box LLMs can generate pretty much anything, and match free-text responses to our discrete classes.


On top of that, processing 1000 test cases took 4 minutes and 12 seconds. The number of processed tokens is ~44000, which results in about $0.066 (and that's just for one run, I've executed the whole notebook a few times while experimenting). Let's see if we can improve that with the few-shot classification.


Few-shot LLM Classification

What's usually suggested you do next then?


Improve the prompt and give the model the model some examples to work with – the few-shot learning. So, taking one example of each class from the training set, we'll use the following template.


Tweet: <negative training sample>
Sentiment: Negative
Tweet: <neutral training sample>
Sentiment: Neutral
Tweet: <positive training sample>
Sentiment: Positive
Tweet: {tweet_text}
Sentiment:


Then we do the same matching procedure. So, let's see the performance on the test set

Accuracy: 0.574

Recall

Negative    0.710526
Neutral     0.553611
Positive    0.476534
dtype: float64

Precision

Negative    0.564179
Neutral     0.569820
Positive    0.597285
dtype: float64


Negative

Neutral

Positive

Unknown

Negative

189

71

6

0

Neutral

121

253

83

0

Positive

25

120

132

0


Well, the performance is somewhat better. The accuracy improved. The recall for the positive case is worse. We get rid of the Unknown predictions, but it's still not too great.

The time and costs though are 4 minutes 37 seconds and 161k tokens * $0.0015/per 1k = $0.2145! So, we've still got a subpar sentiment classifier, that takes a significant time and costs over 20 US cents to classify 1000 tweets. Can we do better? Why yes, let's turn back to some classical ML methods instead.


Classical ML Methods for NLP

One of the standard approaches to sentiment analysis is to turn text into vectors somehow (we'll use one of the simplest approaches – TF-IDF vectorization), and feed it to some classification model (we'll use SVM and Decision Trees). I thought that taking the methods that are usually discussed in the intro to ML and NLP undergrad courses, rather than something sophisticated, would make my point even more clear.


Let's see what we get with those. First of all, vectorization. TF-IDF takes 1.48 seconds to process both training and test sets (total of 75000 tweets!)

Training and inference times for SVC and Decision Trees are recorded below

Method

Training time

Inference time

SVC

3.28s

1.44ms

Tree

23.3s

2.4ms


Blazing fast! And that's without using any GPU.


It will even run on a few-years-old laptop without any issues. Let's now have a look at whether we lose anything in terms of performance.

Method

Accuracy

Recall (Neg/Neu/Pos)

Precision (Neg/Neu/Pos)

SVC

0.85

0.86/0.85/0.83

0.84/0.87/0.82

Tree

0.92

0.94/0.91/0.91

0.88/0.93/0.92


On the contrary, the performance is way better. The decision Tree method is a touch better on all scores, but it takes more time to train (albeit, it's still under a minute). And it cost me virtually nothing (I run everything inside a Colab notebook). So, what does it give us?


Results

Performance:

Method

Accuracy

Recall (Neg/Neu/Pos)

Precision (Neg/Neu/Pos)

Zero-shot LLM

0.51

0.73/0.27/0.69

0.56/0.64/0.50

Few-shot LLM

0.57

0.71/0.55/0.47

0.56/0.57/0.60

SVC

0.85

0.86/0.85/0.83

0.84/0.87/0.82

Tree

0.92

0.94/0.91/0.91

0.88/0.93/0.92

Times and costs:

Method

Training time

Inference time

Cost

Zero-shot LLM

n/a

5min 51s

44k*0.0015=$0.066

Few-shot LLM

n/a

5min 42s

161k*0.0015=$0.2145

TF-IDF + SVC

4.76s

1.44ms

Practically zero

TF-IDF + Tree

24.8s

2.4ms

Practically zero

Well, both classical methods surpass the LLM methods in test performance, time taken, and costs. Huh.


Discussion

One could say that I didn't tune the LLM methods well enough, and I could improve the outcome by prompt engineering, further tweaking, or even fine-tuning. But that's the point. I've already spent $3.75 on OpenAI API today. Why spend even more, if I can achieve rather good results with simple sklearn-provided models (and I could make it even better by spending time tweaking SVMs or Decision Trees, yay)?


One could say that I could use some open-source LLM from Huggingface Hub, and host it on my own hardware, instead of paying for the API. Sure, but I could run the SVM on a Raspberry Pi Zero easily. In contrast, the smallest transformer model capable of such a task will require quite a bit more computing (GPUs are recommended, and a significant amount of RAM is needed). And this computing power still costs money, you know.


One could say that I use a task that's just not very suitable for LLMs. Exactly, that's the point. You can sure try doing that, and in some cases, it might work, but it will likely not be very efficient at it. There are some other tasks that LLMs are very capable of, and other methods just don't do a good job, surely enough. So let's use LLM for these tasks then.


The main point that I want the reader to take away here is that, before grabbing onto an LLM to solve your new task on hands, I want you to consider, if there's a good simple, lightweight, and well-researched method for the thing you want to do. Maybe there indeed is.


This post is directed at nobody specifically but is inspired by the frequent misuse of technology by people on the Internet. The code for this set of experiments is available in the Colab Notebook (provide your dataset files and OpenAI API keys).


Also published here.

Lead image by Krzysztof Niewolny / Unsplash