Last year, I spent three weeks comparing language models as part of a team's development of a customer service pipeline. The task was clear-cut: identify the best-performing model within the team's hardware limitations. I began with an evaluation of the biggest models I believed we could realistically deploy. What I discovered fundamentally changed how I think about model selection. A fine-tuned 3 billion parameter model outperformed our baseline 70B model in every measure of performance that was relevant to the team's needs and by a lot. This wasn't supposed to happen. For years, there has been almost embarrassingly simplistic dogma in the field of artificial intelligence: bigger models perform better. More parameters mean more capabilities. Yet here I was, watching a model that ran consistently well on a gaming laptop outperform a model that required the resources of an enterprise-class system. It turns out I wasn't alone in this discovery. The entire AI industry has been quietly gathering evidence that the "larger is better" age has ended, or at least, will be much less black-and-white moving forward than we initially thought. The Numbers That Made Me Rethink Everything When Microsoft released Phi-4 in late 2024, something remarkable happened, and it was its ability to solve 91.8% of the questions from the November 2024 AMC-10/12 math competition, which were created after it was trained. In doing so, it performed better than GPT-4o and Gemini Pro 1.5. Let that sink in for a moment: here is a model that is approximately one-fifth the size of those two models, able to perform better at solving new mathematical problems than all of the major players in the industry. Phi-4 91.8% There is ever increasing evidence in support of this, as well as many other benchmarks. For example, Alibaba's Qwen2.5-3B model achieved 79.1% on the GSM8K math benchmarks. To put this into perspective, Gemma 2's model, also with 2.6B parameters (similarly sized), was only able to achieve 30.3%. Qwen2.5-3B 79.1% As I have been following these comparisons for many months now, the trend is clear: These 3B models are not only competitive in terms of mathematical reasoning, but in fact win against their larger counterparts. And as previously stated, mathematics is not an edge case; it is exactly the type of structured reasoning that we have long believed required massive scale. Why Data Quality Trumps Model Size The breakthrough insight came from Microsoft's research team, and it's counterintuitive enough that I initially dismissed it. Their claim: a 1.3 billion parameter model trained on 7 billion carefully curated tokens can outperform models ten times its size trained on trillions of unfiltered web tokens. Microsoft found that using "synthetic textbook-quality" data, generated by another model, to train these language models resulted in better results compared to training them with raw data (the whole internet). In essence, instead of allowing the model to feed on whatever it finds on the internet, the model is fed a curated and educationally rich diet. "synthetic textbook-quality" This is exactly what the Phi-series does. Phi-3-mini, with 3.8 billion parameters, achieved a score of 68.8% on MMLU, slightly higher than Mixtral 8x7B, with twelve times fewer parameters. It also scored 8.38 on MT-Bench conversational quality. This puts Phi-3-mini in the same range as much larger models. Phi-series With Phi-4, they took it one step further with a total of 9.8 trillion training tokens, all carefully filtered content. It scored a 56.1% on GPQA, a graduate-level science benchmark. It beat GPT-4o-mini (40.9%) and even Llama 3.3 70B (49.1%). What makes up the training data is important because high-quality synthetic textbooks teaching STEM reasoning, web documents heavily filtered, academic books, and curated Q&A datasets all contribute to performance. Just as importantly, the things you filter out of the training data matter as much as the things you include. The type of data that consumes model capacity without improving reasoning (such as trivia or redundant data), low-quality data, will actually degrade performance per parameter. Industry terminology for such data is "capacity-consuming," and filtering out such data has been found to have a significant impact on model performance. "capacity-consuming," The Architecture Innovations That Made This Possible Understanding why small models can now compete requires diving into the architectural techniques that squeeze maximum capability from each parameter. I'll walk through the most significant ones, since they're essential context for anyone deploying these systems. The first technique is Grouped Query Attention (GQA) in Llama 3.2. In GQA, the traditional method of each attention head having its own Key/Value (KV) pair is removed, and the query heads are grouped so that 32 query heads share the same 8 KV heads. This results in a 4 times reduction in memory bandwidth in inference time compared to the original methods of query heads. Memory bandwidth is typically the limiting factor on deployment. Grouped Query Attention (GQA) Llama 3.2 Another technique is Sliding Window Attention. It was developed in the work of Mistral 7B, and in this model, each layer of attention is limited to the last 4,096 tokens of the sequence instead of the full sequence. This means, although the layer will never see the full context, information still propagates across the layers of the model. 32 layers could theoretically access the full 131K tokens of the sequence, but with a 2 times speed increase over vanilla attention for 16K sequences and 50% less memory required for cache. Sliding Window Attention Mistral 7B Google's Gemma 2 introduced Interleaved Local-Global Attention, where each layer alternates between a 4,096-token window of local attention and an 8,192-token window of global attention. Gemma 2 achieves a 60% decrease in the amount of KV cache memory while allowing full context awareness in those parts of the sequence where it is necessary. Interleaved Local-Global Attention Finally, there are Rotary Position Encodings (RoPE) that are used in place of absolute positional encodings in nearly every small model that exists today. RoPE encodes the positional information directly into the attention mechanism and allows the model to perform better extrapolation to longer sequences than the model was trained to see. Llama 3.2 uses a RoPE base of 500,000 and achieves a sequence length of 128K while only containing 1-3 billion parameters. Rotary Position Encodings (RoPE) 128K Probably the most fascinating of the new architectures is Mixture of Experts (MoE). Microsoft's Phi-3.5-MoE contains 42 billion total parameters but activates only 6.6 billion per token using top-2 routing among 16 experts. This gives you the ability to use a model with the knowledge capacity of a large model, but with the inference costs of a much smaller model. Mixtral 8x7B achieved the performance of Llama 2 70B while using only 13 billion active parameters, which resulted in 6x faster inference at the same quality level. Mixture of Experts (MoE) The Economic Case Is Overwhelming Here's where the practical implications become undeniable. As noted by Andreessen Horowitz, it took $60 per million tokens using GPT-3 to achieve an MMLU score of 42 in November 2021. In comparison, Llama 3.2 3B was able to obtain similar results for $0.06 per million tokens by 2024. Therefore, we can see that there has been a 1,000 times improvement in two years in terms of cost-effectiveness when comparing GPT-3 to Llama 3.2 3B. Andreessen Horowitz 1,000 times improvement Late 2024 API pricing told the story clearly: When looking at sustained workloads for enterprises, Dell Technologies research shows the cost-effectiveness of deploying a small model on premises can be as much as 2.1x to 4.1x better than making cloud API calls. For example, if a 7B model were to process customer support questions at scale, the use of GPT-4 API calls would result in a 225x increase in cost over the deployment of the 7B model. DeepMind's Chinchilla scaling laws have demonstrated that optimal training utilizes approximately 20 tokens per parameter; however, many of the more recent models have pushed well beyond this number. For example, TinyLlama was trained on 3 trillion tokens (a 2,700:1 ratio) to produce a 1.1B model. Qwen3-0.6B was trained at an even greater ratio of 60,000:1. The "over-training" of models in this way provides both smaller and more effective models compared to larger under-trained models and therefore changes the entire optimization calculation. DeepMind's Chinchilla scaling laws Edge Deployment Has Arrived Apple's release of Apple Intelligence in October 2024 used an approximately 3 billion parameter model running entirely on-device for iPhone 15 Pro and later. This included writing assistance, summarizing notifications and enhancing Siri, all of which were done without cloud access. The company states their on-device model matches GPT-3.5 Turbo and outperforms Mistral, Phi-3 and Gemma at similar size models. Apple's release of Apple Intelligence The requirement to run the local model has fallen dramatically: Llama 3.2 1B can be run using about 1.8 GB of GPU memory at 8k context. It is able to run smoothly on modern smartphones. Meta worked with Qualcomm and MediaTek to optimize Llama 3.2 1B for mobile on launch day. Quantized versions of Llama 3.2 1B achieved 4-bit weight with 8-bit activation using SpinQuant and QAT + LoRA. Llama 3.2 1B This is genuine democratization. A method that had to use large-scale commercial infrastructure just eighteen months ago now works with hardware priced at a few hundred dollars. Fine-Tuning Is Now Accessible QLoRA, Quantized Low-Rank Adaptation, has opened up possibilities of customizing language models for people formerly unable to do so. QLoRA cuts the memory required for fine-tuning down by as much as 75-80%, while keeping 80-90% of the fine-tuning quality of fully fine-tuning. QLoRA QLoRA , Quantized Low-Rank Adaptation 75-80% The actual VRAM requirements demonstrate the new reality: - Fine-tuning a 7B model: 60–120GB (needs >$50k in hardware) - LoRA fine-tuning a 7B model: 16–24GB (one RTX 4090) - QLoRA fine-tuning a 7B model: 8–10GB (RTX 3060 12GB) - QLoRA fine-tuning a 65B model: ~48GB (one A100) A researcher with a consumer GPU can now fine-tune models which were once possible only with enterprise computing. The open-source ecosystem has reacted accordingly; Microsoft's Phi models are released under the MIT License, Mistral 7B was released using Apache 2.0 and Meta's Llama 3.2 has been licensed commercially. The opportunity to be innovative with AI is no longer limited by having access to large-scale compute capabilities. The barriers to entering into this space have dropped by orders of magnitude. The Regulatory Tailwind GDPR's data minimization and purpose limitations create an increasingly complicated environment for cloud-based artificial intelligence in Europe for all enterprises. Small on-premises models can clearly provide compliance advantages: Data never leaves the enterprise boundary; Complete audit trails remain within the enterprise boundary; Third party processor risks are eliminated. Healthcare organizations are limited by HIPAA, which often prohibits sending patient data through external APIs. Similar constraints exist in the financial sector for GLBA and PSD2 compliance requirements as well. In addition to existing regulations, the EU Artificial Intelligence Act includes a risk classification system that may add additional layers of complexity for high-risk applications, including human oversight and transparency, that is much simpler to demonstrate using smaller, auditable models. In addition to regulatory pressures, there is also an increasing sustainability component. Google has reported that their median Gemini text prompt uses 0.24 watt hours of energy and generates approximately 0.03 grams CO2e. However, when you consider scale, it is reported that GPT-3 uses enough energy per day to generate roughly 8.4 tons CO2 annually. Corporate carbon footprint mandates are driving organizations to adopt more efficient models. Where Small Models Still Fall Short I would like to point out the limitations as clearly as possible. Overstating the abilities of small models will benefit no one. Small models are generally less likely to produce complex multi-step reasoning (across multiple knowledge domains) compared to larger models. Smaller models also perform poorly when it comes to broad knowledge retrieval; for example, Phi-4 achieves only 3.0% on the factual knowledge questions in the SimpleQA dataset, while Llama 3.3 70B achieves 20.9%. Long-form coherent content creation (involving sustained high-quality narration) requires both sufficient context and sufficient parameters to generate quality content. The emerging rule of thumb from enterprise deployments: Small models are ideal for well-defined domain-specific tasks, especially those that have been fine-tuned to compensate for their reduced base model capabilities. However, generic and open-ended applications with unknown or variable query types tend to remain best suited for larger models or hybrid routing approaches. This is not a story of how all small models will eventually replace all large models. This is about identifying which tools work best in what role and acknowledging that this distinction has changed dramatically in the direction of small models for a number of application areas beyond what was previously assumed. The Trajectory Is Clear The most significant change taking place in AI is not in terms of architecture, but rather philosophy. The 2020-2023 era presumed that scale (and therefore size) was the primary factor to achieve high performance. However, the 2024-2025 evidence now indicates that careful design and optimization of training data; efficient use of architectures; and optimizing tasks, can replace the need for scale. This trend only accelerated through 2025, with Qwen3 overtaking Llama as the most fine-tuned model family, and distilled reasoning models like DeepSeek-R1 bringing frontier capabilities to deployable sizes. Qwen3 Thus, a well-trained 3B model can be as effective as a carelessly trained 70B model. This creates a ripple effect throughout the areas of deployment economics, environmental sustainability, privacy compliance and hardware accessibility. In fact, models released through 2025 already exceed the capabilities of many of the large models used in 2023 on most benchmark tests. Thus, the question isn't if smaller models will dominate edge deployments, they already do. The question is how long until the entire industry realizes that the majority of production workloads do not require a 70B model, to deliver equivalent performance with a properly fine-tuned 3B model at a fraction of the expense. For those of us creating and deploying AI systems, this represents both a challenge and an opportunity. The challenge is to unlearn the "bigger is better" intuition that has governed our development of AI for so long. The opportunity is to build sophisticated AI applications that provide greater speed, less expense, greater privacy and greater sustainability than what was possible even twelve months ago.

The is an opinion piece based on the author’s POV and does not necessarily reflect the views of HackerNoon.

This story contains AI-generated text. The author has used AI either for research, to generate outlines, or write the text itself. 

Small Language Models are Closing the Gap on Large Models

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

AI Is Still Culturally Blind

AI Is Still Culturally Blind

From 140GB to 4GB: The Art of LLM Quantization

HSM: The Original Tiering Engine Behind Mainframes, Cloud, and S3

Edge AI: the Innards, Use Cases, Benefits, and Barriers

Putting the Blinders on AI

AI Is Still Culturally Blind

AI Is Still Culturally Blind

From 140GB to 4GB: The Art of LLM Quantization

HSM: The Original Tiering Engine Behind Mainframes, Cloud, and S3

Edge AI: the Innards, Use Cases, Benefits, and Barriers

Putting the Blinders on AI

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps