Welcome to this ”to-the-point” tutorial on how to quantize any Large Language Model (LLM) available on Hugging Face using llama.cpp. Whether you're a data scientist, a machine learning engineer, or simply an AI enthusiast, this guide is designed to clarify the process of model quantization and make it easy.
By the end of this tutorial, you'll have a clear understanding of how to efficiently compress LLMs without significant loss in performance, enabling their deployment on resource-constrained environments. You can also use these model on your fav local setups using Ollama!
GitGub Repo is here: https://github.com/mickymultani/QuantizeLLMs
Before we dive into the technicalities, let's briefly discuss what model quantization is and why it's important. Model quantization is a technique used to reduce the precision of the numbers used in a model's weights and activations.
This process significantly reduces the model size and speeds up inference times, making it possible to deploy state-of-the-art models on devices with limited memory and computational power, such as mobile phones and embedded systems.
llama.cpp is a powerful tool that facilitates the quantization of LLMs. It supports various quantization methods, making it highly versatile for different use cases. The tool is designed to work seamlessly with models from the Hugging Face Hub, which hosts a wide range of pre-trained models across various languages and domains.
This tutorial is designed with flexibility in mind, catering to both cloud-based environments and local setups. I personally conducted these steps on Google Colab, utilizing the NVIDIA Tesla T4 GPU, which provides a robust platform for model quantization and testing.
However, you should not feel limited to this setup. The beauty of llama.cpp and the techniques covered in this guide is their adaptability to various environments, including local machines with GPU support.
For instance, if you're using a MacBook with Apple Silicon, you can follow along and leverage the GPU support for model quantization, thanks to the cross-platform compatibility of the tools and libraries we are using.
Google Colab provides a convenient, cloud-based environment with access to powerful GPUs like the T4. If you choose Colab for this tutorial, make sure to select a GPU runtime by going to Runtime > Change runtime type > T4 GPU
. This ensures that your notebook has access to the necessary computational resources.
For those opting to run the quantization process on a MacBook with Apple Silicon, ensure that you have the necessary development tools and libraries installed. While the setup might slightly differ from the Linux-based Colab environment, Python's ecosystem and the compilation of llama.cpp with CUDA support (or the equivalent for your specific GPU architecture) are well-documented, ensuring a smooth setup process.
Regardless of your platform, access to models from the Hugging Face Hub requires authentication. To seamlessly integrate this in your workflow, especially when working in collaborative or cloud-based environments like Google Colab, it's advisable to set up your Hugging Face authentication token securely.
On Google Colab, you can safely store your Hugging Face token by using Colab's "Secrets" feature. This can be done by clicking on the "Key" icon in the sidebar, selecting "Secrets", and adding a new secret with the name HF_TOKEN
and your Hugging Face token as the value. This method ensures that your token remains secure and is not exposed in your notebook's code.
For local setups, consider setting the HF_TOKEN
environment variable in your shell or utilizing the Hugging Face CLI to log in, thereby ensuring that your scripts have the necessary permissions to download and upload models to the Hub without hard-coding your credentials.
Now, let's walk through the process of quantizing a model using llama.cpp. For this tutorial, we'll quantize the "google/gemma-2b-it" model from Hugging Face, but the steps can be applied to any model of your choice.
First, we need to clone the llama.cpp repository and install the necessary requirements:
!git clone https://github.com/ggerganov/llama.cpp
!cd llama.cpp && LLAMA_CUBLAS=1 make && pip install -r requirements.txt
This command clones the llama.cpp repository and compiles the necessary binaries with CUDA support for GPU acceleration. It also installs Python dependencies required for the process.
Next, we download the model from Hugging Face Hub using the snapshot_download
function. This function ensures that we have a local copy of the model for quantization:
from huggingface_hub import snapshot_download
model_name = "google/gemma-2b-it"
base_model = "./original_model/"
snapshot_download(repo_id=model_name, local_dir=base_model, local_dir_use_symlinks=False)
Before quantizing, we convert the downloaded model to a format compatible with llama.cpp (gguf
). This step involves specifying the desired precision (f16
for half-precision floating point) and the output file:
!mkdir ./quantized_model/
!python llama.cpp/convert-hf-to-gguf.py ./original_model/ --outtype f16 --outfile ./quantized_model/FP16.gguf
With the model in the correct format, we proceed to the quantization step. Here, we're using a quantization method called q4_k_m
, which is specified in the methods
list. This method quantizes the model to 4-bit precision with knowledge distillation and mapping techniques for better performance:
import os
methods = ["q4_k_m"]
quantized_path = "./quantized_model/"
for m in methods:
qtype = f"{quantized_path}/{m.upper()}.gguf"
os.system("./llama.cpp/quantize "+quantized_path+"/FP16.gguf "+qtype+" "+m)
After quantization, it's important to test the model to ensure it performs as expected. The following command runs the quantized model with a sample prompt from a file, allowing you to assess its output quality:
! ./llama.cpp/main -m ./quantized_model/Q4_K_M.gguf -n 90 --repeat_penalty 1.0 --color -i -r "User: " -f llama.cpp/prompts/chat-with-bob.txt
Finally, if you wish to share your quantized model with the community, you can upload it to Hugging Face Hub using the HfApi
and upload_file
functions. This step involves creating a new repository for your quantized model and uploading the .gguf
file:
from huggingface_hub import HfApi, create_repo, upload_file
model_path = "./quantized_model/Q4_K_M.gguf"
repo_name = "gemma-2b-it-GGUF-quantized"
repo_url = create_repo(repo_name, private=False)
api = HfApi()
api.upload_file(
path_or_fileobj=model_path,
path_in_repo="Q4_K_M.gguf",
repo_id="yourusername/gemma-2b-it-GGUF-quantized",
repo_type="model",
)
Make sure to replace "yourusername"
with your actual Hugging Face username!
Congratulations! You've just learned how to quantize a large language model using llama.cpp. This process not only helps in deploying models to resource-constrained environments but also in reducing computational costs for inference.
By sharing your quantized models, you contribute to a growing ecosystem of efficient AI models accessible to a broader audience. Quantization is a powerful tool in the machine learning practitioner's toolkit.
Happy quantizing!