Fast KV Compaction Makes Long Context LLMs Practical

The KV cache bottleneck and why it matters

Language models scale to long contexts, but only if you solve one problem first: the key-value cache explodes in size. When a model processes a document, it stores numerical embeddings of every token in memory. These embeddings let the model attend to relevant parts later. But storing everything comes at a cost. In deployed systems handling documents of 10,000 tokens or more, the KV cache becomes the bottleneck, not the model's computation.

The typical approach is summarization. Keep 10% of the tokens, discard the rest. The savings are immediate and the cost is delayed. Performance drops because summaries lose nuance, specific details, and temporal structure. Downstream tasks suffer.

There's another class of solutions that tries to optimize which tokens to keep. Earlier work like Cartridges shows you can train a compressor end-to-end, running the full model forward, measuring loss, backpropagating gradients, and iterating toward a better compressed cache. This produces high-quality results but takes hours per context. In production, you can't afford that. Every user request brings a new context. You need compression in seconds.

This creates a painful tradeoff: fast compression loses quality, slow compression preserves it. This paper reveals that the tradeoff is false. There's a third path.

Attention reveals what compression should preserve

Here's the insight that unlocks everything: models don't use all tokens equally. At each layer, each attention head focuses on specific parts of the input through a measurable lens called attention. Attention weights tell you exactly which tokens the model's reasoning actually touches.

This reframes the compression problem completely. Instead of asking "what's semantically important?", ask "which tokens does the model actually attend to?" If you can build a compressed cache that makes the model attend the same way, it will behave nearly identically, even if some background tokens vanish.

This is not the same as summarization. You're not trying to preserve meaning, you're trying to preserve mechanical behavior. The model's attention patterns are its compression blueprint.

Previous work optimized this end-to-end, treating the model as a black box and slowly tweaking the compressed cache until attention outputs matched. That requires hundreds of forward passes and gradient updates. But there's a faster path: solve for the compressed cache mathematically, directly, without touching the model weights at all. Once you recognize that attention outputs are what matter, the problem decomposes into clean subproblems. Some of these admit closed-form solutions, straight algebra, no iteration needed.

This mathematical reframe is where the 50x speedup comes from.

Decomposing attention into matchable subproblems

The attention output for a single head follows a simple formula: softmax(Q @ K.T) @ V. Given a full cache and a target size, the problem is to find a subset of keys and values that will produce attention outputs as close as possible to the original.

This matching problem naturally decomposes. Each attention head becomes an independent subproblem. That independence is mathematically precious. You solve hundreds of small problems instead of one enormous one.

The general approach divides into two stages. First, decide which tokens to keep. Second, optionally adjust their values to better match the original attention output.

For the first stage, several strategies emerge, each occupying different points on a speed-quality tradeoff. The simplest approach is purely greedy: select the tokens that receive the highest total attention weight across all queries. This is fast, requires no iteration, and often performs surprisingly well.

A more sophisticated approach uses Orthogonal Matching Pursuit, an algorithm from signal processing. OMP iteratively selects tokens to minimize reconstruction error. It's slower than greedy selection but produces higher quality compressed caches. The iteration is rapid because the problem is small, just a few hundred tokens per head.

Between these extremes, the paper develops variants that allocate different numbers of tokens to different heads. Some heads tolerate aggressive compression without breaking performance. Others are fragile. If you compress uniformly, you waste space on robust heads and starve sensitive ones. Smarter allocation distributes tokens where they matter most.

For the second stage, once you've decided which tokens to keep, you can solve for their optimal values in closed form. This is a least-squares problem, and linear algebra provides an exact solution. You're not approximating, just solving a smaller problem more efficiently.

Head sensitivity exposes non-uniform compression opportunities

A naive greedy approach would select the highest-attention tokens globally, applying the same compression ratio across all heads. But heads are not created equal.

Testing reveals this dramatically. Fix all KV heads to a baseline compression ratio, then vary a single head's compression ratio while measuring loss change. Some heads tolerate 50x compression with minimal impact. Others break significantly when compressed to 10x. This variation across heads is substantial and consistent across models and datasets.

Head sensitivity curves in Qwen3-4B. Fixing all KV heads to a 5% compression ratio, then varying one head's ratio reveals which heads are robust (flatter curves) and which are fragile (steeper curves). Loss change is minimal for some heads even at extreme compression.

This discovery motivates adaptive allocation. Instead of a uniform compression budget, you can solve for the optimal distribution. Give robust heads higher compression ratios, spend your token budget on sensitive heads. This is similar to how a human might summarize differently depending on subject matter: compress the boilerplate, preserve the critical details.

The optimization to find this allocation is itself a clean problem. Use the head sensitivity curves to allocate tokens proportionally, then validate that the allocation produces downstream performance close to the full cache.

Optimized head budgets compared to global selection. The optimized allocation (right) shows similar structure to global highest-attention selection (left) but with strategic clustering where sensitive heads receive more tokens.

Why design choices matter and what drives them

Several decisions shaped the method's final form, and testing them validates that the approach is robust.

First, which queries should you use to measure attention patterns? You could use all tokens in context, or you could sample queries that match expected usage. The paper tests eight variants. Self-study-based methods, where you sample from the model's own predicted queries, perform best, especially at high compression ratios. This makes intuitive sense: you're compressing for realistic attention patterns, not theoretical ones. Other approaches like sampling from context prefill work well too. The point is that multiple reasonable choices work, suggesting the core insight is fundamental.

Reference query sampling comparison. Self-study-based methods (sampling the model's own queries) outperform alternatives, particularly at high compression. This indicates the method is stable and not dependent on lucky choices.

Second, how do you aggregate attention weights when deciding which tokens matter? You could sum across all heads, average them, take the maximum per token, or use other functions. Different aggregations highlight different tokens as important.

Effect of aggregation function on key selection. Different aggregation functions (sum, max, mean) produce different token selections, but all competitive variants maintain similar accuracy-compression tradeoffs.

These variations matter but the method is stable across them. This robustness builds confidence that the approach isn't brittle or dependent on one fortunate choice.

Validating the core insight on real benchmarks

Everything hinges on whether fast compaction actually works for downstream tasks. The paper tests on QuALITY, a reading comprehension benchmark with long documents, and LongHealth, a medical question answering dataset. Both involve real-world contexts where understanding specific information matters.

The results confirm the insight. You can compress the KV cache to 5-10% of its original size with negligible accuracy loss, in seconds, using AM-OMP or simpler greedy variants. Compare this to Cartridges, the previous state-of-the-art, which requires hundreds of seconds to achieve similar quality.

Accuracy vs. compaction time trade-off across methods on QuALITY. Notice the log scale on the x-axis. Methods cluster into two groups: fast approaches (under 10 seconds) that nearly match slow approaches, and Cartridges extending far right at hundreds of seconds.

This speedup isn't a marginal improvement. It's the difference between viable and impractical. You cannot pre-compute KV caches for every possible user query. In production, you compress on the fly, per request. Hours of optimization breaks your system. Seconds of optimization fits into the latency budget.

Across multiple models, Llama 3.1-8B and Qwen3-4B, the pattern holds. Fast methods achieve similar accuracy to slow methods at the same compression ratio.

Accuracy vs. compaction ratio across methods. AM-OMP and AM-HighestAttentionKeys match or exceed Cartridges and prior methods, while running 50x faster.

Ablation and robustness testing

To validate that each component contributes meaningfully, the paper conducts leave-one-out experiments, ablating components of AM-OMP and measuring how performance degrades.

Leave-one-out experiments on AM-OMP. Removing each component (OMP selection, value optimization, adaptive allocation) increases log-perplexity, confirming that each part of the method contributes to quality.

Removing the OMP selection stage and reverting to pure greedy selection hurts performance, especially at high compression. Removing value optimization (the closed-form adjustment stage) also degrades quality. Removing adaptive allocation (the per-head budget optimization) reduces performance. Each component carries weight. The method is not over-engineered; it's minimal.

This validates that the method is well-designed, not dependent on one lucky choice. The components work together systematically.

Reconstruction loss predicts downstream performance

A useful finding: the per-head reconstruction loss (how well compressed attention outputs match the original) correlates with final downstream task accuracy. This validates the optimization target itself.

Reconstruction loss vs. downstream accuracy. Left shows log-perplexity using the original cache as a reference. Right plots downstream multiple-choice accuracy against this loss across different compression ratios and methods. The correlation is clear: better reconstruction implies better downstream performance.

This correlation matters because it justifies the entire framing. You're optimizing for the right thing. Preserving attention outputs preserves downstream behavior. This is not an assumption, it's empirically validated.

Why this approach works where others don't

The speedup comes from two sources: choosing a better-structured problem and solving it directly rather than indirectly.

Cartridges optimizes end-to-end. Run the full model forward, measure loss on downstream tasks, backpropagate through the entire computation graph, update the compressed cache, repeat. This is indirect optimization through a complex function. It requires hundreds or thousands of iterations. Each iteration is expensive because the function is large.

Attention Matching solves the problem directly. You formulate the matching problem mathematically and solve it using algebra and efficient algorithms. For some subproblems, there's a closed-form solution requiring no iteration. For others like OMP, iteration is necessary but the problem is small enough that it completes in milliseconds. You're not tuning millions of parameters indirectly. You're solving a concrete problem of dimension a few hundred directly.

This is why the method is fast. It's not a special trick or a clever heuristic. It's the natural consequence of choosing a problem that mathematics can handle efficiently.

When compression works and when it doesn't

The method excels on tasks where the model's attention is sparse and structured. Long-document reading comprehension, information retrieval, question answering over context, these all have attention patterns you can compress. The model focuses on relevant parts and ignores irrelevant ones.

The method is less effective on tasks where every token attends to nearly every other token. If attention is dense, you can't compress meaningfully without losing information. The paper doesn't claim universal compression, just fast and high-quality compression for practical long-context workloads.

The broader impact on KV cache management

This work connects to a growing literature on efficient long-context inference. Earlier approaches like token merging and attention pruning tried to reduce computation. KV cache eviction strategies like G-KV decoding manage which tokens to keep during generation. Low-rank attention methods compress the attention computation itself.

Attention Matching sits in this landscape by focusing on compression of stored key-value representations. The key insight, that attention patterns encode compressibility, likely applies to these other approaches too. If you can measure what the model actually uses, you can optimize for it directly.

Why attention matching changes the game

For years, KV cache compression was seen as a learned problem. Train a compressor, learn what matters, optimize end-to-end through the model. This framing made it expensive because you were tuning something abstract (the compressed cache) through indirect feedback (downstream loss).

This paper reveals it's a structural problem. The model's attention is visible and measurable. By learning to compress along the grain of attention, rather than fighting it, you get both speed and quality. The compression becomes mechanical rather than learned. Mathematics handles it.

The deeper realization is that models don't consume information uniformly. They have structure. Attention is that structure, made visible. Compression should exploit structure, not fight it.

This principle likely extends beyond KV caching. Activation memory and compute have structure too. Future work might ask the same question: what is the model actually using? Once you identify that, you can compress along those lines efficiently and mathematically, not through expensive end-to-end optimization.

For practitioners deploying long-context models now, the takeaway is immediate: you can handle 5-10x longer contexts in the same memory footprint, without pre-computing compression and without significant quality loss. The speedup means you can compress per-request in production. The mathematical foundation means the approach is predictable and principled.

This is a Plain English Papers summary of a research paper called Fast KV Compaction via Attention Matching. If you like these kinds of analysis, join AIModels.fyi or follow us on Twitter.