I wanted to talk a little about one thing I've noticed recently. This is essentially subtweeting on a bunch of people I see on different platforms, who seem to be preoccupied with LLMs (and huge models in general) and appear to claim that they can now solve everything. However, it's important to not forget about the "classical" ML/NLP methods, since they are still powerful for many well-established tasks. a brief comparison of sentiment analysis – a well-known and studied NLP problem – performance using four methods: two LLM-powered ones, and two classical ones (for a good measure) What this post will contain: Zero-Shot LLM Classification Few-Shot LLM Classification TF-IDF + Support Vector Classifier TF-IDF + Decision Tree Classifier What this post will not contain: A tutorial on sentiment analysis, NLP, or ML in general. If you need a refresher on these topics I can suggest geeks for geeks, it's quite alright. It will not be scientifically rigorous in the slightest. I'm using a toy dataset, the implementation is quite hand-wavy and nowhere near optimal. I've essentially had a couple of hours of spare time, to cook this up, so here we are. Problem setting Sentiment Analysis is a well-known problem – given some text, we want to mark it as positive, negative, or neutral. I've semi-randomly picked this from Kaggle. It has three classes (Negative, Neutral, Positive). It comes with a pre-split for training (74000 tweets) and testing (1000 tweets). Twitter sentiment dataset I will use accuracy, recall (one-vs-rest per class), and precision (one-vs-rest per class) to assess test performance. I will also record the time of training and inference, as well as the cost of inference. For both LLM methods, I will be using OpenAI's API for the gpt-3.5-turbo-instruct model. Experiments Zero-shot LLM Classification The first and obvious method to try is zero-shot classification: ask an LLM to generate a response, given a prompt of the following form: Tweet: {tweet_text} Sentiment: Then, we just match responses to Positive, Negative, or Neutral (regarding anything other than these three as Unknown). For zero- and few-shot we are not using the training set. So let's see how it performs on our test set. Accuracy: 0.509 Recall Negative 0.729323 Neutral 0.271335 Positive 0.689531 dtype: float64 Precision Negative 0.557471 Neutral 0.635897 Positive 0.502632 dtype: float64 Negative Neutral Positive Unknown Negative 194 31 17 24 Neutral 123 124 172 38 Positive 31 40 191 15 Well, it's not perfect. It's alright, there's a decent recall in Negative and Positive cases. However, many cases are Unknown. This is one of the major drawbacks: out-of-the-box LLMs can generate pretty much anything, and match responses to our discrete classes. free-text On top of that, processing 1000 test cases took 4 minutes and 12 seconds. The number of processed tokens is ~44000, which results in about $0.066 (and that's just for one run, I've executed the whole notebook a few times while experimenting). Let's see if we can improve that with the few-shot classification. Few-shot LLM Classification What's usually suggested you do next then? Improve the prompt and give the model the model some examples to work with – the few-shot learning. So, taking one example of each class from the training set, we'll use the following template. Tweet: <negative training sample> Sentiment: Negative Tweet: <neutral training sample> Sentiment: Neutral Tweet: <positive training sample> Sentiment: Positive Tweet: {tweet_text} Sentiment: Then we do the same matching procedure. So, let's see the performance on the test set Accuracy: 0.574 Recall Negative 0.710526 Neutral 0.553611 Positive 0.476534 dtype: float64 Precision Negative 0.564179 Neutral 0.569820 Positive 0.597285 dtype: float64 Negative Neutral Positive Unknown Negative 189 71 6 0 Neutral 121 253 83 0 Positive 25 120 132 0 Well, the performance is somewhat better. The accuracy improved. The recall for the positive case is worse. We get rid of the Unknown predictions, but it's still not too great. The time and costs though are 4 minutes 37 seconds and 161k tokens * $0.0015/per 1k = $0.2145! So, we've still got a subpar sentiment classifier, that takes a significant time and costs over 20 US cents to classify 1000 tweets. Can we do better? Why yes, let's turn back to some classical ML methods instead. Classical ML Methods for NLP One of the standard approaches to sentiment analysis is to turn text into vectors somehow (we'll use one of the simplest approaches – TF-IDF vectorization), and feed it to some classification model (we'll use SVM and Decision Trees). I thought that taking the methods that are usually discussed in the intro to ML and NLP undergrad courses, rather than something sophisticated, would make my point even more clear. Let's see what we get with those. First of all, vectorization. TF-IDF takes 1.48 seconds to process both training and test sets (total of 75000 tweets!) Training and inference times for SVC and Decision Trees are recorded below Method Training time Inference time SVC 3.28s 1.44ms Tree 23.3s 2.4ms Blazing fast! And that's without using any GPU. It will even run on a few-years-old laptop without any issues. Let's now have a look at whether we lose anything in terms of performance. Method Accuracy Recall (Neg/Neu/Pos) Precision (Neg/Neu/Pos) SVC 0.85 0.86/0.85/0.83 0.84/0.87/0.82 Tree 0.92 0.94/0.91/0.91 0.88/0.93/0.92 On the contrary, the performance is way better. The decision Tree method is a touch better on all scores, but it takes more time to train (albeit, it's still under a minute). And it cost me virtually nothing (I run everything inside a Colab notebook). So, what does it give us? Results Performance: Method Accuracy Recall (Neg/Neu/Pos) Precision (Neg/Neu/Pos) Zero-shot LLM 0.51 0.73/0.27/0.69 0.56/0.64/0.50 Few-shot LLM 0.57 0.71/0.55/0.47 0.56/0.57/0.60 SVC 0.85 0.86/0.85/0.83 0.84/0.87/0.82 Tree 0.92 0.94/0.91/0.91 0.88/0.93/0.92 Times and costs: Method Training time Inference time Cost Zero-shot LLM n/a 5min 51s 44k*0.0015=$0.066 Few-shot LLM n/a 5min 42s 161k*0.0015=$0.2145 TF-IDF + SVC 4.76s 1.44ms Practically zero TF-IDF + Tree 24.8s 2.4ms Practically zero Well, both classical methods surpass the LLM methods in test performance, time taken, costs. Huh. and Discussion One could say that I didn't tune the LLM methods well enough, and I could improve the outcome by prompt engineering, further tweaking, or even fine-tuning. But that's the point. I've already spent $3.75 on OpenAI API today. Why spend even more, if I can achieve rather good results with simple sklearn-provided models (and I could make it even better by spending time tweaking SVMs or Decision Trees, yay)? One could say that I could use some open-source LLM from Huggingface Hub, and host it on my own hardware, instead of paying for the API. Sure, but I could run the SVM on a Raspberry Pi Zero easily. In contrast, the smallest transformer model capable of such a task will require quite a bit more computing (GPUs are recommended, and a significant amount of RAM is needed). And this computing power still costs money, you know. One could say that I use a task that's just not very suitable for LLMs. Exactly, that's the point. You can sure try doing that, and in some cases, it might work, but it will likely not be very efficient at it. There are some other tasks that LLMs are very capable of, and other methods just don't do a good job, surely enough. So let's use LLM for these tasks then. The main point that I want the reader to take away here is that, before grabbing onto an LLM to solve your new task on hands, I want you to consider, if there's a good simple, lightweight, and well-researched method for the thing you want to do. Maybe there indeed is. This post is directed at nobody specifically but is inspired by the frequent misuse of technology by people on the Internet. The code for this set of experiments is available in the Colab Notebook (provide your dataset files and OpenAI API keys). Also published . here Lead image by Krzysztof Niewolny / Unsplash