Fast KV Compaction via Attention Matching: The Paper, Explained

A beginner-friendly guide to KV cache compaction and its application in multi-agent systems. Every AI term is defined. Every concept is grounded in analogy.

Paper by Adam Zweiger, Xinghong Fu, Han Guo, Yoon Kim (MIT, 2026) • Blog post by Ramp Labs • Explainer published April 13, 2026

made withHyperFrames Attention Matching compresses a model's KV cache by up to 50x while preserving nearly identical output quality.

The Big Picture

Modern AI language models like GPT, Claude, and Gemini have a memory problem. As conversations get longer, they need to remember more and more, and that memory isn't free — it takes up expensive GPU space. This paper introduces a clever way to shrink that memory by up to 50x while barely losing any quality.

Think of your phone's photo storage. You have thousands of high-resolution photos, but storage is limited. You could delete old photos (lossy!), write text descriptions of them (summarization — you lose detail), or you could intelligently compress them so they look nearly identical but take up far less space. This paper does the third option, but for AI memory.

Here's what the paper solves:

The memory wall — Language models store a "key-value cache" that grows linearly with conversation length. Long conversations can eat many gigabytes of GPU memory per request.
Existing solutions are bad — Current approaches either summarize context (lossy and slow) or drop tokens (quality degrades fast at high compression).
Gradient-based approaches are too slow — Prior work (Cartridges) could compress well but took GPU-hours per context. This paper achieves similar quality in seconds.

Instead of training a compact cache end-to-end (expensive), the authors decompose the problem into simple subproblems — select the best keys, correct the attention weights, and reconstruct the values — each solvable with direct mathematical formulas (no iterative optimization needed).

Background Concepts

Transformers and Attention

The TransformerThe dominant neural network architecture behind modern LLMs. It processes text by letting every word "attend to" every other word, learning which words are relevant to each other. is the architecture behind essentially every modern language model. Its core mechanism is attentionA mechanism where the model computes relevance scores between all pairs of positions in a sequence, allowing each position to "look at" and gather information from the most relevant other positions.: a way for the model to figure out which parts of the input are relevant to each other.

Imagine you're at a crowded party trying to follow a conversation. Your brain automatically "attends" to the person speaking to you while filtering out background noise. Attention in a Transformer works similarly — for every word being generated, the model assigns relevance scores to every prior word, then takes a weighted average of their information.

Deep dive: How attention works mathematically

Attention uses three sets of vectors for each token:

Query (Q) — "What am I looking for?"
Key (K) — "What do I contain?"
Value (V) — "What information do I carry?"

The model computes scores by comparing each Query against all Keys (via dot product), normalizes them with softmaxA function that converts a list of numbers into probabilities that sum to 1. It exponentiates each number and divides by the total, making larger values much more prominent., and uses those scores to take a weighted average of the Values.

Attention(Q, K, V) = softmax(Q · K^T / √d) · V

The √d scaling prevents the dot products from getting too large, which would push softmax into regions where its gradients are tiny.

The KV Cache

When a language model generates text one token at a time, it would be wasteful to recompute attention over the entire context for every new token. Instead, models store the Key and Value vectors from all previous tokens in a KV cacheA memory store that holds the Key and Value vectors from all previously processed tokens. This avoids redundant computation during text generation, but grows linearly with sequence length..

Think of a student taking a test with access to their notes. Rather than re-reading the entire textbook for every question (full recomputation), they keep organized notes (the KV cache) that they can quickly reference. The problem: the longer the class, the thicker the notebook, and eventually it doesn't fit on the desk.

For a model like Qwen3-4B (a 4-billion-parameter language model) with a 128K-token context window (roughly 100,000 words), the KV cache alone can reach many gigabytes per request. This is the bottleneck that compaction addresses.

Token Eviction and Merging (Prior Approaches)

Before this paper, the main approaches to shrinking the KV cache were:

Token Eviction

Simply delete the KV entries for tokens deemed unimportant. Problem: at high compression ratios (20x+), performance drops sharply because you're permanently losing information.

Summarization

Replace the original text with a shorter summary. Problem: it's slow (requires a full LLM call), lossy, and the summary might not capture what's actually needed for downstream tasks.

Cartridges (Prior Work)

CartridgesA prior method (Eyuboglu et al., 2025) that trains compact KV caches end-to-end using prefix-tuning on synthetic "self-study" data. Achieves high compression but requires GPU-hours per context. showed that you can build very compact KV representations in latent spaceThe internal representation space of a neural network, as opposed to "token space" which is the raw text. Latent-space compaction manipulates the model's internal vectors directly rather than summarizing text. (the model's internal representation, not the text itself). But it does this through end-to-end gradient-based optimizationAn iterative method for minimizing a loss function by computing gradients (derivatives) and adjusting parameters in the direction that reduces the loss. Think of it as slowly walking downhill to find the lowest point in a valley. — essentially training a small model for every context. This takes 5+ GPU-hours per context, making it impractical for real-time use.

Grouped Query Attention (GQA)

Modern models use GQAGrouped Query Attention: instead of having one set of Keys and Values per attention head, multiple Query heads share the same KV pair. This reduces KV cache size by the grouping factor while maintaining most of the model's expressiveness. to reduce the KV cache size. Instead of each attention head having its own keys and values, multiple query heads share a single KV head. For example, Qwen3-4B has 36 processing layers, each with 32 query heads but only 8 KV heads — a 4x reduction in memory.

Rotary Position Embeddings (RoPE)

RoPERotary Position Embeddings: a method for encoding position information into attention by rotating key and query vectors. The angle of rotation depends on position, so the dot product between a query and key naturally encodes their relative distance. is how modern models encode the position of tokens in a sequence. It works by rotating the key and query vectors by an angle that depends on their position, so that the dot product between a query at position i and a key at position j naturally encodes their relative distance |i - j|.

It's like a clock where each token gets placed at a different time. When the model compares two tokens, the "angle between them on the clock" tells it how far apart they are in the text.

This matters for compaction because even though we reduce the physical number of KV entries, we keep the original position IDs so that future tokens still get the correct positional relationships.

How It Works: Attention Matching

The core idea is elegant: instead of trying to optimize a compact cache end-to-end, just make the compact cache produce the same attention behavior as the original. If the attention outputs match, the model can't tell the difference.

Attention over concatenated blocks decomposes into a mixture of each block's local attention output, weighted by that block's "attention mass." So matching both the local attention output AND the attention mass guarantees correctness even when the compacted cache is combined with new tokens later.

Formally, the method finds compacted keys C_k, scalar biases β, and compacted values C_v such that for all relevant queries q:

softmax(q · C_k^T + β) · C_v ≈ softmax(q · K^T) · V

∑ exp(q · C_k,j^T + β_j) ≈ ∑ exp(q · K_j^T)

The first equation matches the attention output (what information gets passed forward). The second matches the attention mass (how much total "weight" this block contributes when combined with other blocks).

The Three-Step Pipeline

Original Keys K Original Values V Reference Queries Q_ref

↓

1️⃣

Select Compact Keys C_k

Pick the most important keys from the original set (by attention score ranking or a greedy selection algorithm)

↓

2️⃣
Fit Bias Corrections β
Solve for scalar biases (via nonnegative least squares) so the compact keys match the original attention mass

↓

3️⃣

Fit Compact Values C_v

Solve for new values via least squares so attention outputs match the original

↓

Compacted Cache (C_k, β, C_v)

Step 1: Selecting Keys

The method restricts compact keys to be a subset of the original keys (rather than optimizing new key vectors freely). This avoids the need for gradient descent. Two selection strategies are offered:

Highest Attention Keys (Fast)

Compute attention scores for all reference queries, aggregate them (using root-mean-square averaging), and keep the top-scoring keys. Simple and fast (~3 seconds for 60K tokens).

OMP Keys (Best Quality)

Orthogonal Matching PursuitA greedy algorithm from signal processing. It iteratively picks the key whose column best reduces the residual error in matching attention mass, then refits all weights. Think of it as building a team one person at a time, always picking the person who fills the biggest gap. greedily selects keys that directly minimize the mass-matching error. Better quality but slower (~100s with speedups).

Imagine you're making a highlight reel of a 3-hour movie. "Highest Attention" is like picking the most-watched scenes — fast to decide, usually good. OMP is like a film editor carefully selecting each clip to maximize how well the highlight reel covers all plot points — better result, takes more time.

Step 2: Fitting the Bias β

Here's a subtle but critical piece. When you keep only t out of T keys, the total attention mass (the sum of exponential scores) is necessarily smaller. This means the compacted block would get underweighted when concatenated with other blocks (like new user messages).

The solution: add a scalar bias β_j to each kept key, which is equivalent to multiplying its contribution by exp(β_j). This lets each kept key "represent the mass of many removed keys." The biases are found by solving a nonnegative least squaresAn optimization problem: find non-negative weights w that minimize ||Aw - b||². The non-negativity constraint ensures the weights (which represent exp(β)) are physically meaningful — you can't have negative attention mass. (NNLS) problem.

If you're making a team of 5 representatives for a class of 50 students, each representative needs to speak not just for themselves but for the ~10 students they represent. The bias is like giving each representative a "voting weight" proportional to how many people they speak for.

made withHyperFrames Without bias corrections, the compacted block's attention mass drops and gets underweighted. Biases restore the correct mass.

Step 3: Fitting Values

With keys and biases fixed, finding the best values is a straightforward least squaresA standard method for finding the best-fit solution to an overdetermined system of equations. It minimizes the sum of squared differences between predicted and actual values. Has a closed-form solution: C = (X^T X)^-1 X^T Y. problem. We want values C_v such that the attention-weighted output matches the original:

C_v = (X^TX)^-1 X^T Y

where X is the matrix of compacted attention weights and Y is the matrix of original attention outputs. This has a direct closed-form solution — no iteration needed.

Reference Queries

All three steps depend on "reference queries" Q_ref — example queries that represent the kinds of attention patterns the model will produce. The paper explores several strategies:

Repeat-prefill: Prompt the model with "Repeat the previous context" and extract the query vectors it generates — fast and effective
Self-study: Generate synthetic interactions (questions, summaries, fact extraction) about the context and collect query vectors — best quality but slower
Context-prefill: Simply run the context through the model once and extract queries — cheapest but slightly less effective

Deep dive: On-policy queries

There's a subtle issue with compacting layers sequentially: compacting early layers changes the data flowing through the network, so later layers see different inputs than they would in the original model. This means the queries they generate don't match what the unmodified model would produce.

The solution is on-policy query extraction: compact layers in order, and for each layer, extract reference queries by running the model with all previously compacted layers already in place. This gives slight but consistent improvements.

Latent Briefing: Applying Compaction to Multi-Agent Systems

The Ramp Labs blog post takes the Attention Matching framework and applies it to a practical problem: efficiently sharing context between AI agents. In multi-agent systems where an orchestrator delegates tasks to worker models, there's a "token explosion" problem.

Imagine a CEO (orchestrator) managing a team of analysts (workers). The CEO has been in every meeting, read every report, and has a rich mental model of the project. When they ask an analyst to investigate something specific, they could either (a) forward every email they've ever received (expensive), (b) write a summary (slow, might miss what the analyst needs), or (c) share a compressed version of their memory that prioritizes what's relevant to this specific task. Latent Briefing does option (c).

The Token Explosion Problem

In the Recursive Language Model (RLM)A multi-agent framework where a strong orchestrator model decomposes tasks and makes repeated calls to a worker model through a REPL environment. The orchestrator builds up reasoning over many calls. framework, an orchestrator (like Claude Sonnet 4) makes repeated calls to a worker model (like Qwen3-14B). Each call, the orchestrator passes the raw document plus a targeted query. But the orchestrator has been building up a rich reasoning trajectory — hypotheses tested, passages identified, dead ends eliminated — that the worker never sees.

Three Key Modifications

1. Task-Guided Query Vectors
Instead of sampling reference queries from the context itself, use queries derived from the orchestrator's specific task prompt. This compresses the cache to prioritize information most relevant to the current sub-task.

↓

2. Shared Token Selection (Global Scoring)

Instead of each of the 320 attention heads independently picking which tokens to keep, aggregate scores across all heads into a single consensus vote. This enables batched GPU execution.

↓

3. MAD Thresholding

Instead of keeping a fixed number of tokens (top-k), keep every token scoring above a statistically derived threshold using Median Absolute Deviation.

made withHyperFrames The orchestrator's reasoning trajectory is compacted using the specific task query as a relevance signal, giving the worker a focused view.

The shared global mask is the key to making compaction real-time. The original AM algorithm required 320 separate GPU operations (one per KV head across all 40 layers × 8 heads) because each head selected different tokens, making matrix shapes incompatible. With a shared mask, all 320 solves can be stacked into 2-3 batched tensor operations, reducing compaction from 30+ seconds to ~1.7 seconds.

Constructing the Compact Cache

Generate Queries

Run repeat-prefill and/or self-study to extract reference query vectors Q_ref

→

Select Keys

Choose which keys to keep via highest-attention or OMP

→

Fit β via NNLS

Compute bias corrections to match attention mass

→

Fit C_v via least squares

Reconstruct values to match attention outputs

This entire process is done per KV-head, per layer. For a model like Qwen3-4B (36 layers × 8 KV heads = 288 solves), this is fast because each solve is a small, straightforward math problem.

Wall-Clock Timing

On a 60K-token context with Gemma-3-12B (a 12-billion-parameter model) on a single high-end GPU:

Stage	Method	Time
Query Generation	Context-prefill	7s
Query Generation	Repeat-prefill	8s
Query Generation	Self-study	139s
Key Selection	Highest attention	3s
Key Selection	OMP-fast	104s
Key Selection	OMP (full)	565s
β fitting	NNLS	2.2s
Value fitting	Least squares	1.8s

The fastest variant (AM-HighestAttnKeys-fast using only repeat-prefill) takes roughly 15 seconds total — orders of magnitude faster than Cartridges' 5+ GPU-hours.

Inference & Chunked Compaction

At inference time, the compacted cache drops right into the model's normal attention mechanism. The only additions are the scalar biases β, which are supported natively by efficient attention implementations like FlashAttentionAn optimized attention implementation that reduces memory usage from O(n²) to O(n) by computing attention in tiles and never materializing the full attention matrix. The standard in modern LLM inference. and FlexAttention (optimized attention libraries).

Chunked Compaction for Long Contexts

For very long contexts (e.g., 60K+ tokens), the method applies compaction independently to chunks of the input, then concatenates the compacted caches:

[prefix] + [compacted chunk₁] + … + [compacted chunk_N] + [suffix]

Two variants exist:

KV-based chunking (preferred): Prefill the full context once, slice out each chunk's KV states, compact independently, and merge. Preserves cross-chunk interactions from the original prefill.
Text-based chunking: Prefill each chunk in isolation, then adjust position encodings so each chunk's keys line up with their original positions in the full document. An approximation because chunks don't see each other during prefill.

The cache retains a logical length of T (the original sequence length) even though its physical size is only t. New tokens appended after compaction receive the same position IDs they would under the full cache — the physical size is decoupled from the positional encoding.

Nonuniform Compaction

Not all attention heads are created equal. Some heads barely use their KV cache (they only attend to nearby tokens or to special tokens that absorb excess attention), while others are critical for long-range retrieval. A uniform compaction ratio across all heads wastes budget on insensitive heads.

It's like allocating shelf space in a library. The reference section (critical heads) needs lots of space because people constantly look things up there. The magazine rack (insensitive heads) barely gets used — you can shrink it without anyone noticing.

made withHyperFrames Uniform allocation wastes budget on insensitive local heads. Optimized allocation reallocates to sensitive retrieval heads that need it most.

The paper's key finding: head sensitivity rankings are stable across different inputs. This means you can measure which heads need more KV budget once per model, then reuse that allocation forever.

Deep dive: How head budgets are computed

For each KV head, hold all other heads at a baseline compaction ratio and vary this head's ratio. Plot the resulting change in model loss. The slope of this "sensitivity curve" tells you how much each head benefits from additional capacity.

Then, solve a resource allocation problem: starting from a uniform budget, iteratively swap units of KV budget from insensitive heads to sensitive heads until no swap improves the total loss. This produces a model-specific allocation that's reused across all contexts.

The ablation study shows that nonuniform head budgets are the single most important component of the full method — more important than learned biases, learned values, or self-study queries.

This finding also explains why hybrid architectures with sliding-window attentionAn attention variant where each layer only attends to a fixed-size window of recent tokens, rather than the full context. Models like Gemma use a mix of sliding-window and full-attention layers. (like Gemma-3) work so well: even in full-attention models, most heads naturally specialize in local context, while only a few capture long-range dependencies.

Results

Accuracy vs. Compaction Ratio

Evaluated on QuALITY (long-document question answering, 894 questions across 50 contexts) and LongHealth (clinical record QA, 60K tokens per context):

Method	Type	50x Compaction Quality	Speed
AM-OMP	Attention Matching	Best (Pareto frontier)	~10 min
AM-OMP-fast	Attention Matching	Near-best	~2 min
AM-HighestAttnKeys	Attention Matching	Strong	~15s
Cartridges	End-to-end optimization	Strong (better at 100x on some benchmarks)	~5 GPU-hrs
KVzip	Token selection	Moderate	Fast
H2O+ / SnapKV	Token eviction	Poor at 50x	Very fast
Summarization	Token space	Poor (esp. LongHealth)	Slow

Key results across three models (Qwen3-4B, Llama3.1-8B, Gemma3-12B):

AM methods consistently achieve the best tradeoff between compaction speed and quality of compaction time vs. quality
At 50x compaction, AM matches or exceeds Cartridges while being 100-1000x faster
Summarization fails catastrophically on information-dense tasks (LongHealth), performing no better than no context at all
AM can be stacked on top of summarization for 200x total compaction with quality comparable to summarization alone

Component Analysis: What Matters Most?

The authors tested removing each component one at a time to measure its importance (ordered by impact):

Nonuniform head budgetsRemoving this causes the largest quality drop. Most critical component.

Learned values & biasesUsing original values (no fitting) degrades quality noticeably.

Learned values onlyFitting C_v via least squares helps, even without bias corrections.

Bias correctionsRemoving β hurts, especially at high compaction ratios where mass correction matters most.

Self-study queriesAdds a small improvement over repeat-prefill alone.

On-policy queriesSlight but consistent benefit from sequential compaction with on-policy extraction.

Latent Briefing Results

Ramp Labs tested Latent Briefing on LongBench v2 (126 questions, documents from 0-100K tokens), using Claude Sonnet 4 as the orchestrator and Qwen3-14B as the worker:

Metric	Best Result	Condition
Accuracy vs. baseline	+3 percentage points improvement	At optimal threshold per condition
Median token savings (total)	21-31%	At optimal thresholds
Worker token reduction	42-57% (up to 65%)	Median across thresholds
Compaction overhead	~1.7s median	Scales linearly with input length

The Threshold Puzzle

An interesting finding: the optimal compaction threshold varies systematically:

Long Documents (32K-100K): Light Compaction

Threshold t=-1.0 (18% compaction) wins. Long documents have dispersed information — you need broad coverage. Still achieves 57% worker token savings.

Hard Questions: Aggressive Compaction

Threshold t=2.0 (79% compaction) wins by +3 percentage points over baseline. Hard questions cause the orchestrator to explore many speculative hypotheses. Aggressive compaction acts as a relevance filter, stripping noise.

It's like taking notes. When studying a broad topic, you want detailed notes because the details matter. But when brainstorming a hard problem, most of what you write down is exploration that won't lead anywhere — a concise summary of the best ideas is more useful than the full stream of consciousness.

Final Quiz

1. Why does the paper use scalar biases (β) in the compacted cache?

To encode position information (like RoPE) To correct for the lost attention mass when tokens are removed To make the model run faster during inference To prevent gradient explosion during training

2. What is the most important component of the Attention Matching method according to ablation studies?

Self-study queries On-policy query extraction Nonuniform head budgets Learned bias corrections

3. Why does Latent Briefing use "task-guided" queries instead of sampling queries from the context itself?

They're faster to compute They compress the cache to prioritize information relevant to the specific worker task The original AM paper requires it To avoid copyright issues with the original context

4. How does a compacted cache maintain correct positional relationships with new tokens?

It recomputes positions from scratch It keeps the original logical length T and position IDs, even though physical size is smaller It doesn't — positional information is lost during compaction New tokens are always appended at position 0

5. Why does aggressive compaction help with hard questions in the Latent Briefing experiments?

Hard questions require less context The orchestrator generates speculative reasoning that dilutes the signal, so compaction acts as a relevance filter The GPU runs faster with smaller caches Hard questions have shorter correct answers

6. What makes the shared global mask critical for real-time compaction in Latent Briefing?

It reduces the number of attention heads in the model It enables batched GPU operations instead of 320 sequential CUDA kernel launches It eliminates the need for bias corrections It compresses the text before passing it to the model

Why This Paper Matters

For AI Infrastructure Engineers

KV cache memory is one of the biggest operational costs in serving language models. Every request to a long-context model like Claude or GPT-4 allocates gigabytes of GPU memory just for the cache. This paper provides a drop-in compaction primitive that reduces that footprint by up to 50x with minimal quality loss and no model changes. For companies running thousands of concurrent requests, this translates directly to serving more users on the same hardware — or using significantly cheaper hardware for the same workload.

For the Research Community

The paper challenges the assumption that high-quality cache compaction requires expensive gradient-based optimization. By decomposing the problem into closed-form subproblems (key selection, bias fitting, value reconstruction), it shows that simple linear algebra can match or beat end-to-end optimization while being orders of magnitude faster. The finding that nonuniform head budgets are the single most important component — more than learned biases or values — provides actionable insight for anyone building KV cache reduction systems.

The Bigger Picture

As AI systems move toward long-horizon use cases — multi-session conversations, agentic coding, persistent assistants — the question of how to manage context efficiently becomes critical. Today's solutions (summarization, token dropping, larger context windows) are either lossy, brittle, or expensive. Attention Matching offers a principled alternative: compress the model's internal representations while preserving their mathematical behavior. Combined with Ramp Labs' Latent Briefing work showing how this enables efficient memory sharing between agents, it points toward a future where AI systems can maintain rich context over arbitrarily long interactions without proportional cost increases.