Fine-tuning is an art and a methodical process similar to software engineering. Recent blogs and vlogs have only covered the coding aspect of it overlooking the different stages or steps involved in the process. So, in this bird’s-eye view, I have elaborated on what it takes to go from a pre-trained Large Language Model(LLM) to a fine-tuned LLM ready for production release. I will be diving deeper into each of these steps with hands-on in my upcoming articles.
Whenever we are given a pre-trained model such as LLAMA 2, in most cases it is sufficient with a bit of prompt engineering. Prompt engineering is all about cleverly constructing the query to the LLM. Good prompts get far better responses from LLM than ordinary prompts. For example, if I ask the model, “What are the qualities of a good model”, it assumes I am asking about a fashion model and comes up with an answer listing physical attributes, confidence, professionalism, adaptability, etc. But if I provide some context by prompting, “What are the qualities of a good LLM model”, it now responds with the answer I was expecting and says “large vocabulary, contextual understanding, fluency, consistency”. This is a naive example of prompt engineering.
In another example, if we ask LLAMA 2 a domain-specific question, “How long is the current inflation likely to last”, it gives a factual response with the different inflation scenarios such as short-term inflation, moderate inflation, higher inflation, and hyperinflation without actually answering my question.
It is these scenarios that indicate that the model is lacking an expert-level skill in a particular domain. The solution to this problem lies with fine-tuning.
If you would like a visual explanation of this article, it is available as a video below:
Fine-tuning is when you modify the pre-trained LLM by updating its parameters. Fine-tuning is useful if you want to make the LLM an expert in a specific domain such as medicine or if you want to make the model much more specialist in a specific task of your liking such as a conversational AI or reasoning. As a result, we get a proprietary, in-house model for the particular task or problem that your organization is uniquely solving using the data available at its disposal.
Fine-tuning generally improves the model performance. It was clearly shown in the Instruct-GPT paper (in the below figure) that GPT-3 with supervised fine-tuning showed better performance compared to simply prompt engineering. These are based on the scores given to the model output on a scale of 1 to 7 by human labelers.
Below is an overview of the six steps organizations have to follow to take a pre-trained model, fine-tune it and deploy it in production.
The first step involved in fine-tuning is selecting a pre-trained model. For this, we have a wide array of models like LLAMA2, Vicunna, and Alpaca. Even though they are all classed as open-source LLMs, different models do well at different tasks.
For example, the RoBERTa model developed by Meta was specifically designed for text classification tasks. And so, if you want to do sentiment analysis, then you are better off choosing this rather than using LongFarmer which is a specialist model if you have a long context length.
The second step towards fine-tuning is, defining the task based on the problem at hand. For example, if you are a law firm and the work involves reading pages and pages of documents, then you may want to fine-tune for summarisation of your documents. So your task is summarisation!
Before deciding to move on with the next step, you might as well want to ensure that there is sufficient data to feed the LLM beasts. To avoid overfitting, you might have to ensure that you have hundreds of thousands of data points for the given task.
The third step and most labor-intensive process for fine-tuning is collating a dataset. At this stage, all the data that you have gathered in the previous stage will go through labeling. Labeling can be very quick or very time-consuming depending on the task you have chosen.
For example, if your task is sentiment classification, the labelers can get away with simply clicking the buttons positive or negative. If your task is summarisation, it could be quite time intensive with the labelers even writing the summaries of documents. Though this step is labor-intensive, you end up with an in-house dataset that is unique to only you and your organization. It can be a huge selling point to your customers to show these datasets and models trained with them.
The fourth and most exciting step is the actual supervised fine-tuning of the pre-trained LLM. All the labeled data from the previous stage now goes to the techies who are often ML engineers or developers who work on finally writing the code to fine-tune the pre-trained LLM.
Though this step may involve only a few lines of coding, it can be quite challenging to tweak hundreds of training parameters to get the training to converge and produce a successful finetuned model. So, this step may involve extensive experimentation and so can take quite some time.
As a simple example, you may have to train one model with a learning rate of 0.01 and another with a learning rate of 0.05. There are tons of optimizers and choices for data augmentation. Playing around with all these parameters is an art in itself and is often the job of a machine learning engineer.
In any case, if you are expecting a fine-tuned model, in the first run of the training, you simply got it wrong. It's more involved than that. This leads to the question, “How do you even know if your fine-tuned model is any good?”.
And so the next step in finetuning is the evaluation of the fine-tuned model. First, you need to choose what metric you are concerned about. And to a large extent, this depends on what the customer exactly wants.
For example, if you are doing sentiment classification, the most important metric most people worry about is the accuracy of classification. But if you are building conversational bots, then the metric for your evaluation would probably be the F1 score. But remember, the customer is god. So the ultimate evaluation for your chat model would be a decrease in the decrease of escalation to first-level calls, conversion rate, email reduction rate, etc.
With metrics, there is always the challenge of mapping the technical evaluation score such as F1 score to the business scores such as conversion rate and escalation rate. How to tackle this challenge opens up a completely new chapter for debate and is not a topic for today’s discussion.
The result of the evaluation can take you in one of three directions:
And that is where the fine-tuning process ends.
In my opinion, the above six steps form the typical lifecycle of a model that goes all the way from pre-trained to deployed state in production.
In the upcoming articles let's do a hands-on to see how we can fine-tune an open-source model using a colab notebook. So please stay tuned and I will see you in my next. Until then, take care…
To get the latest and greatest AI research news from top AI labs, why not follow me on