Fast KV Compaction via Attention Matching: The Paper, Explained
A beginner-friendly guide to KV cache compaction and its application in multi-agent systems. Every AI term is defined. Every concept is grounded in analogy.
The Big Picture
Modern AI language models like GPT, Claude, and Gemini have a memory problem. As conversations get longer, they need to remember more and more, and that memory isn't free — it takes up expensive GPU space. This paper introduces a clever way to shrink that memory by up to 50x while barely losing any quality.
Here's what the paper solves:
- The memory wall — Language models store a "key-value cache" that grows linearly with conversation length. Long conversations can eat many gigabytes of GPU memory per request.
- Existing solutions are bad — Current approaches either summarize context (lossy and slow) or drop tokens (quality degrades fast at high compression).
- Gradient-based approaches are too slow — Prior work (Cartridges) could compress well but took GPU-hours per context. This paper achieves similar quality in seconds.
Background Concepts
Transformers and Attention
The TransformerThe dominant neural network architecture behind modern LLMs. It processes text by letting every word "attend to" every other word, learning which words are relevant to each other. is the architecture behind essentially every modern language model. Its core mechanism is attentionA mechanism where the model computes relevance scores between all pairs of positions in a sequence, allowing each position to "look at" and gather information from the most relevant other positions.: a way for the model to figure out which parts of the input are relevant to each other.
Deep dive: How attention works mathematically
Attention uses three sets of vectors for each token:
- Query (Q) — "What am I looking for?"
- Key (K) — "What do I contain?"
- Value (V) — "What information do I carry?"
The model computes scores by comparing each Query against all Keys (via dot product), normalizes them with softmaxA function that converts a list of numbers into probabilities that sum to 1. It exponentiates each number and divides by the total, making larger values much more prominent., and uses those scores to take a weighted average of the Values.
The √d scaling prevents the dot products from getting too large, which would push softmax into regions where its gradients are tiny.
The KV Cache
When a language model generates text one token at a time, it would be wasteful to recompute attention over the entire context for every new token. Instead, models store the Key and Value vectors from all previous tokens in a KV cacheA memory store that holds the Key and Value vectors from all previously processed tokens. This avoids redundant computation during text generation, but grows linearly with sequence length..
For a model like Qwen3-4B (a 4-billion-parameter language model) with a 128K-token context window (roughly 100,000 words), the KV cache alone can reach many gigabytes per request. This is the bottleneck that compaction addresses.
Token Eviction and Merging (Prior Approaches)
Before this paper, the main approaches to shrinking the KV cache were:
Token Eviction
Simply delete the KV entries for tokens deemed unimportant. Problem: at high compression ratios (20x+), performance drops sharply because you're permanently losing information.
Summarization
Replace the original text with a shorter summary. Problem: it's slow (requires a full LLM call), lossy, and the summary might not capture what's actually needed for downstream tasks.
Cartridges (Prior Work)
CartridgesA prior method (Eyuboglu et al., 2025) that trains compact KV caches end-to-end using prefix-tuning on synthetic "self-study" data. Achieves high compression but requires GPU-hours per context. showed that you can build very compact KV representations in latent spaceThe internal representation space of a neural network, as opposed to "token space" which is the raw text. Latent-space compaction manipulates the model's internal vectors directly rather than summarizing text. (the model's internal representation, not the text itself). But it does this through end-to-end gradient-based optimizationAn iterative method for minimizing a loss function by computing gradients (derivatives) and adjusting parameters in the direction that reduces the loss. Think of it as slowly walking downhill to find the lowest point in a valley. — essentially training a small model for every context. This takes 5+ GPU-hours per context, making it impractical for real-time use.
Grouped Query Attention (GQA)
Modern models use GQAGrouped Query Attention: instead of having one set of Keys and Values per attention head, multiple Query heads share the same KV pair. This reduces KV cache size by the grouping factor while maintaining most of the model's expressiveness. to reduce the KV cache size. Instead of each attention head having its own keys and values, multiple query heads share a single KV head. For example, Qwen3-4B has 36 processing layers, each with 32 query heads but only 8 KV heads — a 4x reduction in memory.
Rotary Position Embeddings (RoPE)
RoPERotary Position Embeddings: a method for encoding position information into attention by rotating key and query vectors. The angle of rotation depends on position, so the dot product between a query and key naturally encodes their relative distance. is how modern models encode the position of tokens in a sequence. It works by rotating the key and query vectors by an angle that depends on their position, so that the dot product between a query at position i and a key at position j naturally encodes their relative distance |i - j|.
This matters for compaction because even though we reduce the physical number of KV entries, we keep the original position IDs so that future tokens still get the correct positional relationships.
How It Works: Attention Matching
The core idea is elegant: instead of trying to optimize a compact cache end-to-end, just make the compact cache produce the same attention behavior as the original. If the attention outputs match, the model can't tell the difference.
Formally, the method finds compacted keys Ck, scalar biases β, and compacted values Cv such that for all relevant queries q:
∑ exp(q · Ck,jT + βj) ≈ ∑ exp(q · KjT)
The first equation matches the attention output (what information gets passed forward). The second matches the attention mass (how much total "weight" this block contributes when combined with other blocks).
The Three-Step Pipeline
Step 1: Selecting Keys
The method restricts compact keys to be a subset of the original keys (rather than optimizing new key vectors freely). This avoids the need for gradient descent. Two selection strategies are offered:
Highest Attention Keys (Fast)
Compute attention scores for all reference queries, aggregate them (using root-mean-square averaging), and keep the top-scoring keys. Simple and fast (~3 seconds for 60K tokens).
OMP Keys (Best Quality)
Orthogonal Matching PursuitA greedy algorithm from signal processing. It iteratively picks the key whose column best reduces the residual error in matching attention mass, then refits all weights. Think of it as building a team one person at a time, always picking the person who fills the biggest gap. greedily selects keys that directly minimize the mass-matching error. Better quality but slower (~100s with speedups).
Step 2: Fitting the Bias β
Here's a subtle but critical piece. When you keep only t out of T keys, the total attention mass (the sum of exponential scores) is necessarily smaller. This means the compacted block would get underweighted when concatenated with other blocks (like new user messages).
The solution: add a scalar bias βj to each kept key, which is equivalent to multiplying its contribution by exp(βj). This lets each kept key "represent the mass of many removed keys." The biases are found by solving a nonnegative least squaresAn optimization problem: find non-negative weights w that minimize ||Aw - b||². The non-negativity constraint ensures the weights (which represent exp(β)) are physically meaningful — you can't have negative attention mass. (NNLS) problem.
Step 3: Fitting Values
With keys and biases fixed, finding the best values is a straightforward least squaresA standard method for finding the best-fit solution to an overdetermined system of equations. It minimizes the sum of squared differences between predicted and actual values. Has a closed-form solution: C = (X^T X)^-1 X^T Y. problem. We want values Cv such that the attention-weighted output matches the original:
where X is the matrix of compacted attention weights and Y is the matrix of original attention outputs. This has a direct closed-form solution — no iteration needed.
Reference Queries
All three steps depend on "reference queries" Qref — example queries that represent the kinds of attention patterns the model will produce. The paper explores several strategies:
- Repeat-prefill: Prompt the model with "Repeat the previous context" and extract the query vectors it generates — fast and effective
- Self-study: Generate synthetic interactions (questions, summaries, fact extraction) about the context and collect query vectors — best quality but slower
- Context-prefill: Simply run the context through the model once and extract queries — cheapest but slightly less effective
Deep dive: On-policy queries
There's a subtle issue with compacting layers sequentially: compacting early layers changes the data flowing through the network, so later layers see different inputs than they would in the original model. This means the queries they generate don't match what the unmodified model would produce.
The solution is on-policy query extraction: compact layers in order, and for each layer, extract reference queries by running the model with all previously compacted layers already in place. This gives slight but consistent improvements.
Latent Briefing: Applying Compaction to Multi-Agent Systems
The Ramp Labs blog post takes the Attention Matching framework and applies it to a practical problem: efficiently sharing context between AI agents. In multi-agent systems where an orchestrator delegates tasks to worker models, there's a "token explosion" problem.
The Token Explosion Problem
In the Recursive Language Model (RLM)A multi-agent framework where a strong orchestrator model decomposes tasks and makes repeated calls to a worker model through a REPL environment. The orchestrator builds up reasoning over many calls. framework, an orchestrator (like Claude Sonnet 4) makes repeated calls to a worker model (like Qwen3-14B). Each call, the orchestrator passes the raw document plus a targeted query. But the orchestrator has been building up a rich reasoning trajectory — hypotheses tested, passages identified, dead ends eliminated — that the worker never sees.
Three Key Modifications
Constructing the Compact Cache
This entire process is done per KV-head, per layer. For a model like Qwen3-4B (36 layers × 8 KV heads = 288 solves), this is fast because each solve is a small, straightforward math problem.
Wall-Clock Timing
On a 60K-token context with Gemma-3-12B (a 12-billion-parameter model) on a single high-end GPU:
| Stage | Method | Time |
|---|---|---|
| Query Generation | Context-prefill | 7s |
| Query Generation | Repeat-prefill | 8s |
| Query Generation | Self-study | 139s |
| Key Selection | Highest attention | 3s |
| Key Selection | OMP-fast | 104s |
| Key Selection | OMP (full) | 565s |
| β fitting | NNLS | 2.2s |
| Value fitting | Least squares | 1.8s |
The fastest variant (AM-HighestAttnKeys-fast using only repeat-prefill) takes roughly 15 seconds total — orders of magnitude faster than Cartridges' 5+ GPU-hours.
Inference & Chunked Compaction
At inference time, the compacted cache drops right into the model's normal attention mechanism. The only additions are the scalar biases β, which are supported natively by efficient attention implementations like FlashAttentionAn optimized attention implementation that reduces memory usage from O(n²) to O(n) by computing attention in tiles and never materializing the full attention matrix. The standard in modern LLM inference. and FlexAttention (optimized attention libraries).
Chunked Compaction for Long Contexts
For very long contexts (e.g., 60K+ tokens), the method applies compaction independently to chunks of the input, then concatenates the compacted caches:
Two variants exist:
- KV-based chunking (preferred): Prefill the full context once, slice out each chunk's KV states, compact independently, and merge. Preserves cross-chunk interactions from the original prefill.
- Text-based chunking: Prefill each chunk in isolation, then adjust position encodings so each chunk's keys line up with their original positions in the full document. An approximation because chunks don't see each other during prefill.
The cache retains a logical length of T (the original sequence length) even though its physical size is only t. New tokens appended after compaction receive the same position IDs they would under the full cache — the physical size is decoupled from the positional encoding.
Nonuniform Compaction
Not all attention heads are created equal. Some heads barely use their KV cache (they only attend to nearby tokens or to special tokens that absorb excess attention), while others are critical for long-range retrieval. A uniform compaction ratio across all heads wastes budget on insensitive heads.
The paper's key finding: head sensitivity rankings are stable across different inputs. This means you can measure which heads need more KV budget once per model, then reuse that allocation forever.
Deep dive: How head budgets are computed
For each KV head, hold all other heads at a baseline compaction ratio and vary this head's ratio. Plot the resulting change in model loss. The slope of this "sensitivity curve" tells you how much each head benefits from additional capacity.
Then, solve a resource allocation problem: starting from a uniform budget, iteratively swap units of KV budget from insensitive heads to sensitive heads until no swap improves the total loss. This produces a model-specific allocation that's reused across all contexts.
The ablation study shows that nonuniform head budgets are the single most important component of the full method — more important than learned biases, learned values, or self-study queries.
This finding also explains why hybrid architectures with sliding-window attentionAn attention variant where each layer only attends to a fixed-size window of recent tokens, rather than the full context. Models like Gemma use a mix of sliding-window and full-attention layers. (like Gemma-3) work so well: even in full-attention models, most heads naturally specialize in local context, while only a few capture long-range dependencies.
Results
Accuracy vs. Compaction Ratio
Evaluated on QuALITY (long-document question answering, 894 questions across 50 contexts) and LongHealth (clinical record QA, 60K tokens per context):
| Method | Type | 50x Compaction Quality | Speed |
|---|---|---|---|
| AM-OMP | Attention Matching | Best (Pareto frontier) | ~10 min |
| AM-OMP-fast | Attention Matching | Near-best | ~2 min |
| AM-HighestAttnKeys | Attention Matching | Strong | ~15s |
| Cartridges | End-to-end optimization | Strong (better at 100x on some benchmarks) | ~5 GPU-hrs |
| KVzip | Token selection | Moderate | Fast |
| H2O+ / SnapKV | Token eviction | Poor at 50x | Very fast |
| Summarization | Token space | Poor (esp. LongHealth) | Slow |
Key results across three models (Qwen3-4B, Llama3.1-8B, Gemma3-12B):
- AM methods consistently achieve the best tradeoff between compaction speed and quality of compaction time vs. quality
- At 50x compaction, AM matches or exceeds Cartridges while being 100-1000x faster
- Summarization fails catastrophically on information-dense tasks (LongHealth), performing no better than no context at all
- AM can be stacked on top of summarization for 200x total compaction with quality comparable to summarization alone
Component Analysis: What Matters Most?
The authors tested removing each component one at a time to measure its importance (ordered by impact):
Latent Briefing Results
Ramp Labs tested Latent Briefing on LongBench v2 (126 questions, documents from 0-100K tokens), using Claude Sonnet 4 as the orchestrator and Qwen3-14B as the worker:
| Metric | Best Result | Condition |
|---|---|---|
| Accuracy vs. baseline | +3 percentage points improvement | At optimal threshold per condition |
| Median token savings (total) | 21-31% | At optimal thresholds |
| Worker token reduction | 42-57% (up to 65%) | Median across thresholds |
| Compaction overhead | ~1.7s median | Scales linearly with input length |
The Threshold Puzzle
An interesting finding: the optimal compaction threshold varies systematically:
Long Documents (32K-100K): Light Compaction
Threshold t=-1.0 (18% compaction) wins. Long documents have dispersed information — you need broad coverage. Still achieves 57% worker token savings.
Hard Questions: Aggressive Compaction
Threshold t=2.0 (79% compaction) wins by +3 percentage points over baseline. Hard questions cause the orchestrator to explore many speculative hypotheses. Aggressive compaction acts as a relevance filter, stripping noise.
Final Quiz
Why This Paper Matters
For AI Infrastructure Engineers
KV cache memory is one of the biggest operational costs in serving language models. Every request to a long-context model like Claude or GPT-4 allocates gigabytes of GPU memory just for the cache. This paper provides a drop-in compaction primitive that reduces that footprint by up to 50x with minimal quality loss and no model changes. For companies running thousands of concurrent requests, this translates directly to serving more users on the same hardware — or using significantly cheaper hardware for the same workload.
For the Research Community
The paper challenges the assumption that high-quality cache compaction requires expensive gradient-based optimization. By decomposing the problem into closed-form subproblems (key selection, bias fitting, value reconstruction), it shows that simple linear algebra can match or beat end-to-end optimization while being orders of magnitude faster. The finding that nonuniform head budgets are the single most important component — more than learned biases or values — provides actionable insight for anyone building KV cache reduction systems.
The Bigger Picture
As AI systems move toward long-horizon use cases — multi-session conversations, agentic coding, persistent assistants — the question of how to manage context efficiently becomes critical. Today's solutions (summarization, token dropping, larger context windows) are either lossy, brittle, or expensive. Attention Matching offers a principled alternative: compress the model's internal representations while preserving their mathematical behavior. Combined with Ramp Labs' Latent Briefing work showing how this enables efficient memory sharing between agents, it points toward a future where AI systems can maintain rich context over arbitrarily long interactions without proportional cost increases.