I've always been captivated by the idea of running large language models directly on user devices. There's something magical about running Llama 3.1 8B, one of the most advanced language models, on your computer or smartphone.
In this post, I'll introduce you to
You can try it out yourself on the
While running language models on user devices isn't new — models like Llama 3.2 1B and 3B were explicitly designed for low-power devices, the 8B Llama model presents an ideal opportunity to showcase the capabilities of advanced compression algorithms in a browser environment.
To put this in perspective, let's look at the model's memory requirements: each parameter requires 16 bits in its uncompressed form, making the 8B model approximately 16 GB. Standard 4-bit compression methods like nf4 can reduce this to 4 GB.
Our extreme compression approach takes this further, using just 2 bits per parameter and compressing the model body by a factor of 8. The head layers and embeddings still use 4-bit and 8-bit compression, bringing the total compressed model size to around 2.5 GB.
This extreme compression saves space and improves performance.
Since computation speed heavily depends on memory operations, reducing memory requirements directly translates to faster execution. Remarkably, our 2-bit compressed version of Llama 3.1 8B outperforms the uncompressed Llama 3.2 3B while occupying only half the space.
At their core, large language models are collections of matrices. The primary computational workload involves matrix-vector multiplication, where compression methods focus their optimization efforts. These methods aim to create more compact matrix representations while minimizing quality loss.
In May 2024, the Yandex Research team, in collaboration with the Institute of Science and Technology Austria (ISTA) and King Abdullah University of Science and Technology (KAUST), published
Our project uses
Each such group is the sum of two vectors from 256-element dictionaries. This clever approach requires only 16 bits (2 × 8) to store indices for eight matrix elements, achieving our target of 2 bits per parameter.
WebAssembly has revolutionized browser-based programming, enabling development in virtually any language. After studying Rust at the
I was delighted to discover that many fundamental LLM infrastructure libraries are written in Rust. Take Hugging Face's safetensors format—it was built entirely in Rust! Whenever someone uses a safetensors model from Hugging Face in their Python code, they use Rust. Similarly, OpenAI's tiktoken tokenizer, which new Llama models use, is also Rust-based.
To optimize performance, I implemented multithreading using web workers, enabling bidirectional thread communication through message passing. The solution uses a model-parallel approach: matrices are divided by output dimension, with each worker handling its designated portion.
The most challenging aspect was orchestrating the interaction between workers and the main thread. To address this, I developed a custom RPC stack for workers with Rust-JavaScript interoperability. Let me explain the process.
The process involves several steps when the main thread needs to multiply a vector by a matrix. First, the thread creates a request for each worker, serializes it, and sends it to the JavaScript runtime. From there, JavaScript forwards the request to the worker, where it is deserialized and processed, and the result is serialized.
Finally, the result returns through JavaScript to the main thread for deserialization. Through this carefully orchestrated process, we improved the performance by about 2x.
Check out this
Note that the initial loading takes several minutes. For best results, use English — the model performs significantly better this way.
The project is open-source and available on