Everyone is these days, and some of us are poorer than others. So my mission is to fine-tune a model with only one GPU on Google Colab and run the trained model on my laptop using llama.cpp. GPU-poor LLaMA-2 Why fine-tune an existing LLM? A lot has been said about when to do , when to do RAG (Retrieval Augmented Generation), and when to fine-tune an existing LLM model. I will not get into details about those arguments and will leave you with two in-depth analyses to explore on your own. prompt engineering i DeepLearning.ai course by Lamin Blog post “Why You (Probably) Don’t Need to Fine-tune an LLM” by Jessica Yao Assuming you still want to fine-tune your own LLM, let’s get started with fine-tuning. Why be a cheapskate? You can , which has become increasingly affordable both for fine-tuning as well as for inference. There are a few reasons you don’t want to do that: your training data is super secret, you don’t want to pay OpenAI every time you use the fine-tuned model, and you need to use your model without the Internet. In that case, we will use open-source LLMs. fine-tune OpenAI’s GPT-3.5-turbo model Right now, is the golden standard of open-source LLM with good performance and permissible license terms. And we will start with the smallest since it will be cheaper and faster to fine-tune. Once you have gone through the whole process, you will be well on your way to and models if you like. Meta’s LLaMA-2 7B model 13B 70B : Training dataset Dolly 15K by DataBricks : The easiest to use is Google Colab. I believe you do need to have a Colab Pro account which is $10 a month for 100 compute units. In the following examples, you will consume between 20–90 compute units which translates to . I hope we all can afford it, even for cheapskates. Training GPU $2–9 Fine-Tuning LLaMA-2 With QLoRA on a Single GPU We have all heard about the tremendous cost associated with training a large language model, which is not something the average Jack or Jill will undertake. But what we can do is freeze the model weights in an existing LLM (e.g. 7B parameters), while fine-tuning a tiny adapter ( of total parameters, 130M for example). less than 1% One of these adapters is called (Low-Rank Adaptation), not to be confused with the red-haired heroine in the movie “Run, Lola, run!”. LoRA In addition, uses a frozen, 4-bit quantized pre-trained language model instead of a 16-bit model into Low-Rank Adapters (LoRA). Thus we can fit the entire training into the GRAM of a single commodity GPU. QLoRA You can find out the trade-offs between our method and the traditional full-parameter method: Blog post “ Fine-Tuning LLMs: LoRA or Full-Parameter? An in-depth Analysis with Llama 2” There are good tutorials and notebooks on fine-tuning LLaMA-2 models with LoRA, for example: OVH cloud Tutorial and notebook Phillip Schimd Tutorial and notebook In this article, I’m using the with minor changes to the training parameters. OVH Cloud guide I used instance, and the total fine-tuning ran about 7 hours and consumed 91 compute units. Google Colab Pro’s Nvidia A100 high memory Google Colab A100 high memory Nvidia A100 high memory CPU RAM: 83.5GB GPU RAM: 40GB 13 computer units per hour Actual memory usage during training: CPU: 6.1GB (up to, varies) GPU: 25.8GB (up to, varies) You can certainly use a single (15GB GRAM) instance, which will take longer but cost less. I started but did not run through the entire training process, but it was estimated to be about 24 hours and 50 compute units. I’m quite sure someone can use (24GB GRAM) or equivalent consumer GPU for this fine-tuning task as well. T4 high memory Nvidia’s 4090 Note: Phillip Schmid’s script has more tricks to reduce training time: my test run on an A100 high memory instance lasted about an hour and 15 minutes and cost less than 20 compute units. Once the training is done, we to mounted Google Drive so we don’t lose them once the Google Colab session is over: save the LoRA adapter’s final checkpoints output_dir = "results/llama2/final_checkpoint" train(model, tokenizer, dataset, output_dir) You can see that the file “adapter_model.bin” is tiny (152.7B) compared to llama2–7b’s “consolidated.00.pth” (13.5GB). Inference with llama.cpp Both fine-tuning tutorials use GPU-based inference, but a true cheapskate would probably want to use his/her own laptop with a low-spec CPU and GPU. Thus comes into play. Your fine-tuned will run comfortably with fast speed on an M1-based Macbook Pro with 16G unified RAM. You can push to run the 13B model as well if you free up some memory from resource-hungry apps. llama.cpp 7B model There are to get your recently fine-tuned model ready for llama.cpp use. All the models reside in the directory “models”. Let’s create a new directory called “lora” under “models”, copy over all the original llama2–7B files, and then copy over the two adapter files from the previous step. The folder “lora” should have the following files a few simple steps : Convert LoRA adapter model to ggml compatible mode: Step 1 python3 convert-lora-to-ggml.py models/lora : Convert into f16/f32 models: Step 2 python3 convert.py models/lora : Quantize to 4 bits: Step 3 ./quantize ./models/lora/ggml-model-f16.gguf ./models/lora/ggml-model-q4_0.gguf q4_0 Now finally, you have your shining new gguf file that is baked with your special training data. It’s time to use it, or in fancy words ! “inference with it” ./main -m models/lora/ggml-model-q4_0.gguf --color -ins -n -1 , while I have dozens of tabs open in two Chrome browsers, a engine running database and web server, Visual Studio Code, and all the instant messaging systems imaginable all on an average Macbook Pro M1 with 16GB memory. You can see llama-2–7b-Lora is running blazing fast Docker Next steps ! You have just fine-tuned your first personal LLM and run it on your laptop. Now there are : Congratulations a few things you can do next where fine-tuning an existing LLM will give you a unique advantage Define a use case for fine-tuning purposes: typically it will be like DataBricks-dolly-15k in question—answer pairs. Your private data most likely won’t be like that. So you can either use your own scripts, manual labor, and/or GPT-4 to format your data into the right training set. Prepare your own dataset , and compare the different approaches (prompt engineering, RAG, GPT-3.5 fine-tuning, open source LLM fine-tuning). Start with a small amount of training data, and build on your experience and success. Define your evaluation metrics Show the world (and your boss) what you have just built! Feature image created with QLoRA and llama.cpp