The Cheapskate’s Guide to Fine-Tuning LLaMA-2 and Running It on Your Laptop

Everyone is GPU-poor these days, and some of us are poorer than others. So my mission is to fine-tune a LLaMA-2 model with only one GPU on Google Colab and run the trained model on my laptop using llama.cpp.

Why fine-tune an existing LLM?

A lot has been said about when to do prompt engineering, when to do RAG (Retrieval Augmented Generation), and when to fine-tune an existing LLM model. I will not get into details about those arguments and will leave you with two in-depth analyses to explore on your own.

DeepLearning.ai course by Lamini

Blog post “Why You (Probably) Don’t Need to Fine-tune an LLM” by Jessica Yao

Assuming you still want to fine-tune your own LLM, let’s get started with fine-tuning.

Why be a cheapskate?

You can fine-tune OpenAI’s GPT-3.5-turbo model, which has become increasingly affordable both for fine-tuning as well as for inference. There are a few reasons you don’t want to do that: your training data is super secret, you don’t want to pay OpenAI every time you use the fine-tuned model, and you need to use your model without the Internet. In that case, we will use open-source LLMs.

Right now, Meta’s LLaMA-2 is the golden standard of open-source LLM with good performance and permissible license terms. And we will start with the smallest 7B model since it will be cheaper and faster to fine-tune. Once you have gone through the whole process, you will be well on your way to 13B and 70B models if you like.

Training dataset: Dolly 15K by DataBricks

Training GPU: The easiest to use is Google Colab. I believe you do need to have a Colab Pro account which is $10 a month for 100 compute units. In the following examples, you will consume between 20–90 compute units which translates to $2–9. I hope we all can afford it, even for cheapskates.

Fine-Tuning LLaMA-2 With QLoRA on a Single GPU

We have all heard about the tremendous cost associated with training a large language model, which is not something the average Jack or Jill will undertake. But what we can do is freeze the model weights in an existing LLM (e.g. 7B parameters), while fine-tuning a tiny adapter (less than 1% of total parameters, 130M for example).

One of these adapters is called LoRA (Low-Rank Adaptation), not to be confused with the red-haired heroine in the movie “Run, Lola, run!”.

In addition, QLoRA uses a frozen, 4-bit quantized pre-trained language model instead of a 16-bit model into Low-Rank Adapters (LoRA). Thus we can fit the entire training into the GRAM of a single commodity GPU.

You can find out the trade-offs between our method and the traditional full-parameter method:

Blog post “ Fine-Tuning LLMs: LoRA or Full-Parameter? An in-depth Analysis with Llama 2”

There are good tutorials and notebooks on fine-tuning LLaMA-2 models with LoRA, for example:

OVH cloud Tutorial and notebook
Phillip Schimd Tutorial and notebook

In this article, I’m using the OVH Cloud guide with minor changes to the training parameters.

I used Google Colab Pro’s Nvidia A100 high memory instance, and the total fine-tuning ran about 7 hours and consumed 91 compute units.

Google Colab A100 high memory

Nvidia A100 high memory
CPU RAM: 83.5GB
GPU RAM: 40GB
13 computer units per hour

Actual memory usage during training:

CPU: 6.1GB (up to, varies)
GPU: 25.8GB (up to, varies)

You can certainly use a single T4 high memory (15GB GRAM) instance, which will take longer but cost less. I started but did not run through the entire training process, but it was estimated to be about 24 hours and 50 compute units. I’m quite sure someone can use Nvidia’s 4090 (24GB GRAM) or equivalent consumer GPU for this fine-tuning task as well.

Note: Phillip Schmid’s script has more tricks to reduce training time: my test run on an A100 high memory instance lasted about an hour and 15 minutes and cost less than 20 compute units.

Once the training is done, we save the LoRA adapter’s final checkpoints to mounted Google Drive so we don’t lose them once the Google Colab session is over:

output_dir = "results/llama2/final_checkpoint"

train(model, tokenizer, dataset, output_dir)

You can see that the file “adapter_model.bin” is tiny (152.7B) compared to llama2–7b’s “consolidated.00.pth” (13.5GB).

Inference with llama.cpp

Both fine-tuning tutorials use GPU-based inference, but a true cheapskate would probably want to use his/her own laptop with a low-spec CPU and GPU. Thus llama.cpp comes into play. Your fine-tuned 7B model will run comfortably with fast speed on an M1-based Macbook Pro with 16G unified RAM. You can push to run the 13B model as well if you free up some memory from resource-hungry apps.

There are a few simple steps to get your recently fine-tuned model ready for llama.cpp use. All the models reside in the directory “models”. Let’s create a new directory called “lora” under “models”, copy over all the original llama2–7B files, and then copy over the two adapter files from the previous step. The folder “lora” should have the following files

Step 1: Convert LoRA adapter model to ggml compatible mode:

python3 convert-lora-to-ggml.py models/lora

Step 2: Convert into f16/f32 models:

python3 convert.py models/lora

Step 3: Quantize to 4 bits:

./quantize ./models/lora/ggml-model-f16.gguf ./models/lora/ggml-model-q4_0.gguf q4_0

Now finally, you have your shining new gguf file that is baked with your special training data. It’s time to use it, or in fancy words “inference with it”!

./main -m models/lora/ggml-model-q4_0.gguf --color -ins -n -1

You can see llama-2–7b-Lora is running blazing fast, while I have dozens of tabs open in two Chrome browsers, a Docker engine running database and web server, Visual Studio Code, and all the instant messaging systems imaginable all on an average Macbook Pro M1 with 16GB memory.

Next steps

Congratulations! You have just fine-tuned your first personal LLM and run it on your laptop. Now there are a few things you can do next:

Define a use case where fine-tuning an existing LLM will give you a unique advantage
Prepare your own dataset for fine-tuning purposes: typically it will be like DataBricks-dolly-15k in question—answer pairs. Your private data most likely won’t be like that. So you can either use your own scripts, manual labor, and/or GPT-4 to format your data into the right training set.
Define your evaluation metrics, and compare the different approaches (prompt engineering, RAG, GPT-3.5 fine-tuning, open source LLM fine-tuning). Start with a small amount of training data, and build on your experience and success.
Show the world (and your boss) what you have just built!

Feature image created with QLoRA and llama.cpp