The greatest democratization of artificial intelligence is not taking place in the cloud computing world, rather it is happening with the same on hardware you might already own, such as gaming laptops, consumer-grade GPUs and even some MacBook computers, with a process called quantization. quantization I've been following the quantization space closely since mid-2024 and I was completely blown away by the rate at which the technology is developing. For example, a Llama 70B model, which typically requires 140GB of memory when running at full precision, can now be compressed to under 35GB memory utilizing 4-bit quantization. This means that a $1600 GPU can house applications that were exclusive to enterprise environments just two years ago. Llama 70B Quantization does much more than provide hobbyists with the ability to run AI locally. The use of quantization provides the opportunity for local AI to operate while preserving user privacy, dramatically lowers the cost of inference and provides access to the "bleeding edge" of AI capability to edge devices. Therefore, understanding quantization has become a necessity for anybody who will be implementing LLMs at any level. The Mathematics of Shrinking Intelligence The Mathematics of Shrinking Intelligence Converting neural network weights from high-precision floating point to low-precision integers (quantization) reduces the amount of memory required to store them. For example, when using the full precision of FP32, each parameter takes up 4 bytes of memory; when using an integer representation such as INT4, it takes up only half a byte of memory. FP32 INT4 Quantization uses a very simple conversion formula: take the original value of the parameter and divide it by a scale factor, add a zero point to that result, then round to the nearest integer. The progression in precision provides insight into the potential for compression. Going from FP32 to FP16 results in about a 50% reduction in memory usage with nearly no loss in quality. BF16, a truncated float format, which has both the same dynamic range as FP32 and matches the same footprint as FP16, has become the dominant format used in training workflows today. Quantizing to INT8 will reduce the memory requirements by an additional 2× compared to FP16 with relatively small reductions in perplexity (typically < 2%). FP32 FP16 BF16 FP32 FP16 INT8 FP16 However, the greatest potential for compression comes at the INT4 and lower levels of precision, where a 7B model can be reduced from 14GB of memory down to approximately 3.5GB. INT4 There are three important aspects of quantization quality: What gets quantized? How values get grouped? Whether or not the mapping between the source and destination types is symmetric. Weight quantization is a process of compressing the static model parameters. Activation quantization is a process of compressing the dynamic intermediate values generated during inference. Because we know the distribution of weight values ahead of time, they are much easier to quantize. However, activations are difficult to quantize because large language models greater than 6 billion parameters develop emergent outlier characteristics that have magnitudes 100 times larger than those of normal values. These outliers can consume the quantization range and squeeze out information from the remaining activation values. emergent outlier characteristics Granularity, also known as how values are grouped, greatly impacts the quality of the quantization. Per-tensor quantization applies a single scale factor to a group of millions of weights, which is easy to apply but loses significant information. Applying a different scale factor to each of the output channels of a tensor (per-channel quantization) improves the quality of the quantization but increases the memory required for storage. Most modern approaches apply a different scale factor to a group of 64-128 weights (per-group quantization). GPTQ: The GPU Inference Breakthrough GPTQ: The GPU Inference Breakthrough When Frantar et al. from IST Austria released their GPTQ paper in October 2022, they resolved a long-standing scaling problem. Previous optimal quantization methods had O(d³) complexity per layer, making them intractable for billion parameter models. GPTQ GPTQ paper GPTQ achieved this by ordering columns of weights (layer-wise), so that it could leverage computation reuse to achieve 4-bit quantization of a 175B model in approximately 4 GPU hours, utilizing a single A100. GPTQ is based on Optimal Brain Quantization. It utilizes the Hessian Matrix to determine which weights may be quantized with minimal output error. GPTQ processes one layer of weights at a time and updates the other weights as needed to correct for the accumulated error. Optimal Brain Quantization Production adoption has been extensive: vLLM, TensorRT-LLM, Hugging Face Transformers and FastChat, have all implemented native support for GPTQ. Benchmarking results demonstrate that GPTQ achieves 3.25x speed-up during inference vs. FP16 on an A100 and the 3-bit version of OPT-175B reached 8.68 perplexity vs. an 8.34 FP16 baseline. vLLM TensorRT-LLM Hugging Face Transformers FastChat FP16 FP16 GPTQ performs well when the calibration data is similar to the distribution of the deployment environment. However, if there are differences between the two environments, GPTQ can potentially overfit to the calibration samples and distort the learned features that are used to classify out-of-distribution input values. I experienced this repeatedly during my testing. AWQ: Protecting What Matters Most AWQ: Protecting What Matters Most The paper Activation-Aware Weight Quantization (June 2023) from researchers at MIT dramatically shifted my view of model compression. Activation-Aware Weight Quantization Only 0.1-1% of all weights have an impact on overall model performance and only those "salient" weights should be selected based upon activation observations, not magnitude of weights. Only 0.1-1% of all weights have an impact on overall model performance and only those "salient" weights should be selected based upon activation observations, not magnitude of weights. AWQ provides weight protection through mathematical transformations as opposed to utilizing mixed-precision formats, maintaining hardware efficiency. The technical approach provides scaling of salient weight channels upward before quantization to decrease their relative quantization error. AWQ My production testing showed that AWQ significantly outperformed GPTQ. For example, it required only 1/10th the size of calibration sets, or 16 sequences vs. 192 sequences for similar quality. Additionally, AWQ was able to generalize better across domains; specifically, it demonstrated only a 0.5-0.6 perplexity increase when there was a difference between the distribution of the calibration set and the evaluation set, whereas GPTQ demonstrated a 2.3-4.9 perplexity increase. AWQ was also the first method to successfully quantize multimodal models such as OpenFlamingo and LLaVA. OpenFlamingo LLaVA Current benchmarking results show that AWQ achieves approximately 1.45x faster inference than GPTQ using the TinyChat framework, and enables 13B models at 30 tokens/second on laptop RTX 4070 GPUs. GGUF: Models for Everyday Hardware GGUF: Models for Everyday Hardware Started in March 2023 after the Meta’s LLaMA release, Georgi Gerganov's llama.cpp project pursued an alternative objective – to execute LLMs on hardware that does not contain a dedicated GPU. Introduced as a replacement to the original GGML in August 2023, the GGUF format (GPT-Generated Unified Format) stores quantized weights, including all metadata, tokenizer, architecture, context length, in a single self-descriptive file. llama.cpp GGUF GGUF format In comparison to other formats, GGUF’s quantization philosophy is based on simplicity and broad hardware support vs. maximum compression efficiency. GGUF includes K-quant variants (Q4_K_M, Q5_K_S, etc.), which employ nested “super-blocks” of sub-blocks and apply quantization to their own scale factors to enable additional compression. Q4_K_M Q5_K_S As an outcome, the overall ecosystem for llama.cpp is impressive. llama.cpp enables CPU inference using AVX/AVX2/AVX512 on x86 and NEON on ARM; GPU acceleration through CUDA, Metal, HIP, Vulkan, and SYCL; and hybrid execution that offloads layers between CPU and GPU when VRAM is limited. For example, a 70B model can be executed on a 24GB RTX 4090 with CPU offloading to achieve speeds of 8-15 tokens/second. AVX AVX2 AVX512 NEON The project had gained 82,000 GitHub stars by June 2025, and the GGUF format used by the project provides power to many of the current consumer LLM tools (Ollama, LM Studio, GPT4All). Ollama LM Studio GPT4All One of the major advantages of the GGUF format is the rapid creation of a quantized GGUF model compared to the time required to create models using GPTQ or AWQ. Calibration requires hours of processing with these two alternatives, whereas creating a GGUF model takes minutes. Q4_K_M has become the most commonly recommended variant by the community, achieving perplexities that are within .05 of FP16 performance, and approximately 4.5 bits per weight. Q4_K_M FP16 Performance Reality Across Consumer Hardware Performance Reality Across Consumer Hardware Practical trade-offs are identified through specific benchmark results. GPTQ via ExLlamav2 achieved 64 tokens/sec on a Llama 2-13B model using an RTX 3090 at 4-bit; GGUF Q4_K_M achieved 31 tokens/sec (roughly 50% of GPTQ) with much stronger CPU fallback; AWQ was between them at 41 tokens/sec while preserving quality best during instruction-following tasks. ExLlamav2 Q4_K_M Memory required follows the expected trends. A 70B model requires approximately 148GB at FP16, 70-80GB at INT8, and 35-45GB at INT4. The RTX 4090's 24GB VRAM supports any model from 7B-13B without issue, and all 30B-34B models at Q4-Q5. It also supports a 70B model, but only when some of the computation is offloaded to the CPU. FP16 INT8 INT4 The degree of quality loss varies significantly based upon task type, and in many cases, I did not anticipate this. Quality retention for mathematical reasoning is surprisingly robust (retaining >87% of accuracy) at the Q4 level. Quality loss is highest for instruction-following tasks (degrading up to 20%) and multilingual capabilities (degrading 15-20%). Apple Silicon has emerged as a serious platform for local inference. An M3 Max with 64GB unified memory is capable of running Llama 3 8B at 4-bit at 65 tokens/sec via MLX, which is competitive with discrete GPUs. An M4 Ultra with 192GB unified memory is able to support 70B models at FP16 or 400B+ at 4-bit. MLX FP16 The Research Frontier The Research Frontier Recent advancements in 2024 and 2025 indicate the path that quantization will follow. Using robust optimization, Mobius Labs' Half-Quadratic Quantization (HQQ) has eliminated the need for calibration altogether and has been able to quantize the 70 billion parameter Llama-2 model in just 5 minutes, which is 50 times faster than GPTQ. Half-Quadratic Quantization (HQQ) Yandex's AQLM also represents weight groups as sums of codebook entries using additive quantization, allowing for extreme compression. AQLM Microsoft's most ambitious work with regards to quantization has explored the use of native 1-bit training. Microsoft's BitNet b1.58, constrains all weights to either -1, 0 or +1 throughout training and uses this constraint to eliminate matrix multiplications, thus reducing all operations to simple additions. Their bitnet.cpp framework has demonstrated 1.37-6.17 times the speed of FP16 on CPU based systems. The first open-weights native 1-bit model, that exhibited performance equivalent to full-precision 2B models, was BitNet b1.58-2B-4T and was published in early 2025. BitNet b1.58 bitnet.cpp FP16 BitNet b1.58-2B-4T As these models have required new training processes and cannot be trained by converting the models after training; they are leading towards a future when there is no distinction between the training precision of a model and its inference precision. Conclusion Conclusion Quantization of LLMs went from being an optimization strategy for academics to being a foundation for the ability to run AI at a local level. Going from 140 GB to 4 GB isn’t just about compressing the model size, but rather about changing who can deploy and use powerful language models. GPTQ validated that quantization could work on GPUs, AWQ demonstrated that when you protect 1% of your weights, you preserve 99% of capabilities and GGUF validated that we can bring inference to any device that has a CPU. When 1 bit models are able to perform as well as their full precision counterparts, the way we economically deploy AI will be completely different. The "art" of quantization is turning into the "science" of fitting intelligence into any environment. The "art" of quantization is turning into the "science" of fitting intelligence into any environment.

Quant

Microsoft

GitHub

From 140GB to 4GB: The Art of LLM Quantization

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

AI Agents: Why the Gap Between Demo and Deployment Keeps Widening

The Noonification: How Often Do NFTs Pass The Howey Test? (1/13/2023)

Darwin's Hybrid Intelligence to Align AI & Human Goals for Startups & VCs

The Noonification: White Man (11/26/2022)

The Noonification: The Metaverse is a Sh*tshow (11/2/2022)

100 Days of AI Day 1: From Newsletter to Podcast, Leveraging AI for Audio Transformation

AI Agents: Why the Gap Between Demo and Deployment Keeps Widening

The Noonification: How Often Do NFTs Pass The Howey Test? (1/13/2023)

Darwin's Hybrid Intelligence to Align AI & Human Goals for Startups & VCs

The Noonification: White Man (11/26/2022)

The Noonification: The Metaverse is a Sh*tshow (11/2/2022)

100 Days of AI Day 1: From Newsletter to Podcast, Leveraging AI for Audio Transformation

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps