How Much Do Language Models Memorize? The Paper, Explained

A beginner-friendly guide to measuring what language models actually remember. Every AI term is defined. Every concept is grounded in analogy.

Paper by John X. Morris, Chawin Sitawarin, Chuan Guo, Narine Kokhlikyan, G. Edward Suh, Alexander M. Rush, Kamalika Chaudhuri, Saeed Mahloujifar (FAIR at Meta, Google DeepMind, Cornell, NVIDIA, 2025) · Explainer published April 2026

made withHyperFrames Language models memorize ~3.6 bits per parameter, then switch from memorizing to generalizing

The Big Picture

When you train a language modelA computer program that learns to predict the next word in a sequence, trained on massive amounts of text. GPT, Llama, and Claude are all language models. on billions of sentences, does it actually remember specific training examples? Or does it just learn general patterns? This paper proposes a rigorous way to answer that question — and the answer reveals a fundamental constant of how neural networks store information.

Here's what the researchers found:

  1. A new way to measure memorization. They define memorization in terms of compressionRepresenting data in fewer bits. If a model helps you compress a specific datapoint more, it has memorized more about that datapoint. — how many fewer bits does it take to describe a training example when you have the model available? This cleanly separates what the model memorized from what it learned generally.
  2. A universal capacity constant. GPT-style transformerThe neural network architecture behind most modern language models. It processes text using an attention mechanism that lets every word consider every other word. models can store approximately 3.6 bits of information per parameter. A 1-billion-parameter model can memorize about 3.6 billion bits (roughly 430 megabytes) of data.
  3. Capacity fills, then generalization begins. Models first memorize training data until their capacity is full. Once capacity is exhausted, they start generalizing — learning reusable patterns instead of storing individual examples. This transition explains the mysterious double descentA surprising phenomenon where a model's test performance first gets worse, then better again as dataset size increases. It contradicts the classical expectation that more data always helps. phenomenon.
  4. Scaling laws for privacy. They produce equations predicting when membership inferenceAn attack that tries to determine whether a specific data sample was used to train a model. If successful, it reveals private information about the training data. attacks can succeed, showing that modern LLMs are trained on far too much data for such attacks to work on average.
GPT-style models have a hard capacity limit of ~3.6 bits per parameter. Once that fills up, models stop memorizing and start generalizing. This single number predicts double descent, extraction success, and membership inference vulnerability.

Background Concepts

Information Theory Basics

Before we can measure memorization, we need to understand how to measure information itself. Information theoryA branch of mathematics founded by Claude Shannon in 1948 that quantifies information using bits. It defines how much data is needed to represent or communicate something., founded by Claude Shannon in 1948, gives us the tools.

Think of information like surprise. If someone tells you "the sun rose this morning," that's not very informative — you expected it. But if they say "it snowed in Miami in July," that's very informative because it's surprising. Information theory measures exactly how surprising (and therefore informative) each piece of data is, in bits.
Entropy: measuring uncertainty

EntropyA measure of uncertainty or randomness in data. Higher entropy means more unpredictable data, requiring more bits to represent. Denoted H(X). measures how unpredictable a data source is. A fair coin has 1 bit of entropy — each flip is completely unpredictable and requires exactly 1 bit to record. A loaded coin that lands heads 99% of the time has very low entropy — it's mostly predictable.

For a dataset X, H(X) tells you the minimum number of bits needed to represent all the data. A dataset of completely random strings has maximum entropy — each string is equally likely, so there's no shortcut to compress them.

Mutual information: shared knowledge

Mutual informationA measure of how much knowing one thing tells you about another. I(X,Y) = H(X) - H(X|Y): how much uncertainty about X is reduced by knowing Y. measures how much knowing one variable tells you about another. If knowing a model's parameters lets you compress a datapoint into fewer bits, the mutual information between the model and that datapoint tells you exactly how many bits of "knowledge" the model has about it.

The key equation: I(X, θ̂) = H(X) − H(X | θ̂)

In words: the information a model θ̂ has about dataset X equals the total information in X minus whatever information remains unknown even after looking at the model.

Kolmogorov Complexity

Shannon's entropy works for distributions (random variables), but we need to measure memorization of specific datapoints. This is where Kolmogorov complexityThe length of the shortest computer program that produces a given string. It measures the inherent complexity of a specific piece of data, regardless of probability distributions. comes in — it measures the complexity of a single string, not a distribution.

Imagine you need to describe a painting to someone over the phone. A painting of a solid blue square can be described in a few words ("10x10 blue square"). A Jackson Pollock painting requires describing every splatter. Kolmogorov complexity is like the length of that phone call — the shortest possible description for that specific piece of data.
Why compression = memorization

Here's the key insight that makes this paper work: if a trained model helps you compress a datapoint into fewer bits than you could without the model, then the model must have memorized something about that datapoint.

How much was memorized? Exactly the difference in compression: mem(x, θ̂) = HK(x) − HK(x | θ̂)

The number of bits a datapoint x can be shortened when the model θ̂ is available to help compress it. If the model knows nothing about x, it can't help compress it, and memorization is zero.

Arithmetic Coding

Arithmetic codingA compression algorithm that encodes data using the probabilities assigned by a model. The better the model's predictions, the shorter the encoded message. The code length equals the negative log-likelihood under the model. is the bridge between theory and practice. It turns model predictions into actual compression. If a language model assigns high probability to the next token, arithmetic coding uses fewer bits to represent it. This means the negative log-likelihood-log(p(x)): the number of bits needed to encode data x using a model's probability assignments. Lower loss = better predictions = fewer bits = better compression. of a datapoint under a model directly estimates how many bits are needed to compress it.

Arithmetic coding works like a smart filing system. If you told a librarian "the next book will probably be about cooking" and you're right, filing it is quick — it goes in the cooking section. If you said "cooking" but it's about astrophysics, filing takes longer because the librarian was looking in the wrong place. The model's prediction accuracy directly determines compression efficiency.

Membership Inference Attacks

A membership inference attackA privacy attack that tries to determine whether a specific data point was used to train a machine learning model. It exploits the fact that models typically assign lower loss (higher probability) to data they were trained on. tries to determine whether a particular data sample was used to train a model. The basic idea is simple: models tend to assign lower loss to data they were trained on. By setting a threshold on the loss, an attacker can guess whether a sample was in the training set.

How loss-based membership inference works

The attack is straightforward: compute the model's loss on a target sample. If the loss is below a threshold, guess "member" (in training set). If above, guess "non-member."

This works because models overfit slightly to training data, giving those samples lower loss. The paper measures attack success using F1 scoreA metric combining precision and recall into a single number between 0 and 1. F1 = 0.5 means random guessing (useless), F1 = 1.0 means perfect classification. — a score of 0.5 means the attack is no better than random guessing, while 1.0 means perfect detection.

How It Works

The paper's central contribution is a practical framework for measuring memorization. Here's the complete architecture of their approach:

Input
Training Dataset x
A collection of text sequences (or synthetic bitstrings)
Training
Train Target Model θ̂
GPT-style transformer trained from scratch on x
Core Measurement
Compute Compression Rates
For each sample xi: compute −log p(xi | θ̂) and −log p(xi | θ)
Decomposition
Separate Memorization Components
Total = Unintended Memorization + Generalization
Output
Per-Sample Memorization (in bits)
Exact measurement of how much θ̂ "knows" about each datapoint

The Two-Model Trick

The key challenge is distinguishing memorization from generalization. If a model gives a low loss on "2 + 2 = 4", is that because it memorized that specific example, or because it learned arithmetic? The paper solves this with a reference modelA model (θ) representing the "true" data distribution. For synthetic data, it's the known random distribution. For text, it's a larger model trained on a much bigger superset of the data. Any compression the reference model achieves is attributed to generalization, not memorization..

Total Memorization

How much better does the trained model θ̂ compress xi compared to encoding it from scratch?

mem(x, θ̂) = HK(x) − HK(x | θ̂)

Unintended Memorization

How much better does θ̂ compress xi compared to the reference model θ?

memU(x, θ, θ̂) = HK(x | θ) − HK(x | θ, θ̂)

In practice, the reference model is either:

The compression gap between the trained model and the reference model is pure "unintended memorization" — information the model absorbed about specific training samples that goes beyond general language patterns.

From Theory to Practice

While Kolmogorov complexity is uncomputable in theory (you can't find the shortest possible program for arbitrary data), the paper uses arithmetic coding to approximate it. The beauty is that arithmetic coding's code length equals the negative log-likelihood under the model — a quantity we can compute directly.

1
Feed xi to θ̂
Get trained model's loss
2
Feed xi to θ
Get reference model's loss
3
Best compression
max(prob) = min(loss)
4
Subtract
lossθ − min loss
5
Sum
Σ per-sample bits

Experiments

Synthetic Data: Pure Capacity Measurement

The researchers first eliminate generalization entirely by training on random bitstrings — sequences where each token is uniformly random and independent. With no patterns to learn, every bit the model absorbs is pure memorization.

They trained hundreds of GPT-2 style transformers from 100K to 20M parameters, each on datasets ranging from thousands to millions of random sequences.

made withHyperFrames How models fill their bit capacity with memorized data, then hit the 3.6 bits-per-parameter wall

The results were strikingly clean:

Model DepthDimensionsParametersCapacity (bf16)Bits/Param
1 layer128469K1.69M bits3.61
2 layers128667K2.60M bits3.89
4 layers1281.06M3.75M bits3.53
8 layers1281.86M6.49M bits3.49
8 layers2566.86M25.1M bits3.65
Does precision matter?

The researchers also trained in full fp32 precision (32-bit floating point vs 16-bit bfloat16). Doubling the precision from bf16 to fp32 only increased capacity from 3.51 to 3.83 bits per parameter on average — far less than the 2x you might expect. Most of the extra precision bits aren't used for raw storage.

It's like upgrading from a pocket notebook to a full-size legal pad. You get some extra space, but you don't write twice as many notes — your handwriting stays about the same size, so you only use a fraction of the extra room.

Real Text: Memorization vs. Generalization

Next, the researchers repeated their experiments with real text from the FineWebA large-scale, high-quality web text dataset with state-of-the-art deduplication. Used as the source of real-world text in this paper's experiments. dataset. Unlike random bitstrings, real text has learnable patterns — grammar, common phrases, factual knowledge. Now the model's storage splits between memorization and generalization.

Using a larger "oracle" model as the reference, they observed:

When dataset size exceeds model capacity, models are forced to generalize. They can no longer afford to memorize each sample individually, so they find shared patterns to compress information more efficiently.

Double Descent

One of the paper's most compelling findings is a clean explanation for double descentA phenomenon where test loss decreases, then increases (gets worse), then decreases again as training set size or model size grows. It was previously observed but lacked a clean explanation. — a phenomenon that has puzzled researchers for years.

made withHyperFrames How test loss behaves as dataset size crosses the model's capacity threshold

In classical machine learning, the expectation is simple: more data = better performance. But researchers observed that as datasets grow, performance sometimes gets worse before getting better again. The paper shows this happens at a very specific point:

Dataset < Capacity

The model has enough capacity to memorize every sample. Training loss goes to zero. Test loss is decent because memorization helps with similar test samples.

Dataset ≈ Capacity

The danger zone. The model can't memorize everything but hasn't learned to generalize efficiently yet. It's stretched thin, doing neither well. Test loss spikes.

Dataset >> Capacity

The model is forced to find efficient, reusable patterns. It generalizes well and test loss drops below the original level. This is where real learning happens.

The Critical Ratio

Double descent begins exactly when the dataset-to-capacity ratio crosses 1.0. For a model with capacity C bits and a dataset of D bits of information: the transition happens at D/C ≈ 1.

Imagine a student cramming for an exam. With 10 flashcards, they can memorize all 10 perfectly. With 100 flashcards but only room in their head for 50, they're in trouble — they half-remember everything and fully remember nothing. But with 1,000 flashcards, they realize they need to learn the patterns (this type of question always works like this), and suddenly they understand the material at a deeper level.

Membership Inference

The paper's final major contribution is a scaling lawA mathematical relationship that predicts how a quantity (like model performance or memorization) changes as you scale up model size, data size, or compute. Scaling laws let researchers predict the behavior of larger systems from smaller experiments. for membership inference — an equation that predicts exactly when privacy attacks will succeed based on model capacity and dataset size.

made withHyperFrames How membership inference success scales with the capacity-to-dataset ratio

The Scaling Law

For a fixed model capacity, membership inference follows a sigmoidalAn S-shaped curve. In this context, membership inference success transitions from near-perfect (1.0) for tiny datasets to random guessing (0.5) for very large datasets, with a smooth S-shaped transition in between. curve with respect to dataset size:

MembershipF1(θ, D) = ½(1 + c1 · σ(c2 · (Capacity(θ) / |D| + c3)))

Where σ is the sigmoid function and the fitted constants are c1 = 1.34, c2 = −0.034, c3 = −33.14.

The intuitive reading:

Validation on Larger Models

The researchers validated their scaling law by training GPT-2 Medium (124M params) and GPT-2 XL (1.5B params) on dataset sizes predicted to give specific F1 scores:

ModelParametersDataset SizePredicted F1Observed F1
GPT-2 XL1.56B170.7M0.550.546 ± 0.013
GPT-2 XL1.56B76.8M0.750.711 ± 0.004
GPT-2 XL1.56B18.9M0.950.959 ± 0.008
GPT-2 Med124M13.6M0.550.534 ± 0.011
GPT-2 Med124M6.1M0.750.657 ± 0.006
GPT-2 Med124M1.5M0.950.980 ± 0.003

Predictions were generally within 1–2 percentage points of actual values. The largest discrepancy was at the predicted F1 of 0.75, where the sigmoid is steepest (small changes in the ratio produce large changes in F1).

Implications for Modern LLMs

Contemporary language models are trained with a tokens-per-parameter ratio of 100x or more (e.g., Llama 3 with 8B parameters was trained on 15 trillion tokens). According to this scaling law, that puts their predicted membership inference F1 at essentially 0.5 — random guessing.

This paper provides formal evidence for why membership inference attacks fail on modern LLMs: the ratio of data to model capacity is so enormous that individual training samples leave no detectable trace.

Results

Key Findings Summary

FindingValueSignificance
Bits-per-parameter (bf16)3.51 ± 0.1 (avg) / ~3.6 (large models)Universal capacity constant for GPT-style models
Bits-per-parameter (fp32)3.83 ± 0.1Only 9% increase despite 2x precision
Double descent onsetDataset/Capacity ≈ 1First precise prediction of the transition point
Scaling law accuracy± 1.5% F1Predictions match observations on 125M–1.5B param models
Modern LLM MI vulnerabilityF1 ≈ 0.5Attacks are no better than random guessing

Extraction vs. Membership Inference

An interesting finding: membership inference is strictly easier than data extraction. In some cases, membership inference achieved F1 of 0.97 while the extraction rate for the same model-dataset pair was 0.0. This makes sense — detecting a statistical trace of training data is easier than reproducing it verbatim.

For very small training sets with 32-token prefixes, 100% of sequences were extractable. But as datasets grow, extraction converges to the test set extraction rate — meaning all remaining "extraction" is just generalization (the model can produce the text because it learned the underlying patterns, not because it memorized that specific sample).

Deduplication Matters

The researchers found that careful deduplicationThe process of removing duplicate or near-duplicate entries from a dataset. Critical for accurate memorization measurement, since duplicated data is artificially easier to extract. was "extremely important for faithfully measuring extraction rates." When sequences are truncated to 64 tokens, 1–2% become duplicates — enough to significantly bias results. They performed an additional deduplication step on top of FineWeb's existing deduplication.

Final Quiz

A GPT-style model with 1 billion parameters trained in bfloat16 can memorize approximately how much data?
What is the purpose of the reference model (θ) in this paper's framework?
When does double descent begin, according to this paper?
Why are modern LLMs (like Llama 3) essentially immune to membership inference attacks?
How does this paper measure memorization?
Doubling the numerical precision from bfloat16 to float32 does what to model capacity?

Why This Paper Matters

For Practitioners and Builders

If you're building systems with language models, this paper gives you a practical rule of thumb: a model can store about 3.6 bits per parameter. That's roughly 0.45 bytes per parameter, or about 430 MB for a 1B-parameter model. This means:

For the Research Community

This paper makes several contributions that change how we think about language models:

The Bigger Picture

This paper sits at the intersection of information theory, privacy, and deep learning scaling laws — three fields that are converging as AI systems grow. The key insight — that neural networks have a measurable, finite capacity for storing information — has implications far beyond this paper: