How Much Do Language Models Memorize? The Paper, Explained
A beginner-friendly guide to measuring what language models actually remember. Every AI term is defined. Every concept is grounded in analogy.
The Big Picture
When you train a language modelA computer program that learns to predict the next word in a sequence, trained on massive amounts of text. GPT, Llama, and Claude are all language models. on billions of sentences, does it actually remember specific training examples? Or does it just learn general patterns? This paper proposes a rigorous way to answer that question — and the answer reveals a fundamental constant of how neural networks store information.
Here's what the researchers found:
- A new way to measure memorization. They define memorization in terms of compressionRepresenting data in fewer bits. If a model helps you compress a specific datapoint more, it has memorized more about that datapoint. — how many fewer bits does it take to describe a training example when you have the model available? This cleanly separates what the model memorized from what it learned generally.
- A universal capacity constant. GPT-style transformerThe neural network architecture behind most modern language models. It processes text using an attention mechanism that lets every word consider every other word. models can store approximately 3.6 bits of information per parameter. A 1-billion-parameter model can memorize about 3.6 billion bits (roughly 430 megabytes) of data.
- Capacity fills, then generalization begins. Models first memorize training data until their capacity is full. Once capacity is exhausted, they start generalizing — learning reusable patterns instead of storing individual examples. This transition explains the mysterious double descentA surprising phenomenon where a model's test performance first gets worse, then better again as dataset size increases. It contradicts the classical expectation that more data always helps. phenomenon.
- Scaling laws for privacy. They produce equations predicting when membership inferenceAn attack that tries to determine whether a specific data sample was used to train a model. If successful, it reveals private information about the training data. attacks can succeed, showing that modern LLMs are trained on far too much data for such attacks to work on average.
Background Concepts
Information Theory Basics
Before we can measure memorization, we need to understand how to measure information itself. Information theoryA branch of mathematics founded by Claude Shannon in 1948 that quantifies information using bits. It defines how much data is needed to represent or communicate something., founded by Claude Shannon in 1948, gives us the tools.
Entropy: measuring uncertainty
EntropyA measure of uncertainty or randomness in data. Higher entropy means more unpredictable data, requiring more bits to represent. Denoted H(X). measures how unpredictable a data source is. A fair coin has 1 bit of entropy — each flip is completely unpredictable and requires exactly 1 bit to record. A loaded coin that lands heads 99% of the time has very low entropy — it's mostly predictable.
For a dataset X, H(X) tells you the minimum number of bits needed to represent all the data. A dataset of completely random strings has maximum entropy — each string is equally likely, so there's no shortcut to compress them.
Mutual information: shared knowledge
Mutual informationA measure of how much knowing one thing tells you about another. I(X,Y) = H(X) - H(X|Y): how much uncertainty about X is reduced by knowing Y. measures how much knowing one variable tells you about another. If knowing a model's parameters lets you compress a datapoint into fewer bits, the mutual information between the model and that datapoint tells you exactly how many bits of "knowledge" the model has about it.
The key equation: I(X, θ̂) = H(X) − H(X | θ̂)
In words: the information a model θ̂ has about dataset X equals the total information in X minus whatever information remains unknown even after looking at the model.
Kolmogorov Complexity
Shannon's entropy works for distributions (random variables), but we need to measure memorization of specific datapoints. This is where Kolmogorov complexityThe length of the shortest computer program that produces a given string. It measures the inherent complexity of a specific piece of data, regardless of probability distributions. comes in — it measures the complexity of a single string, not a distribution.
Why compression = memorization
Here's the key insight that makes this paper work: if a trained model helps you compress a datapoint into fewer bits than you could without the model, then the model must have memorized something about that datapoint.
How much was memorized? Exactly the difference in compression: mem(x, θ̂) = HK(x) − HK(x | θ̂)
The number of bits a datapoint x can be shortened when the model θ̂ is available to help compress it. If the model knows nothing about x, it can't help compress it, and memorization is zero.
Arithmetic Coding
Arithmetic codingA compression algorithm that encodes data using the probabilities assigned by a model. The better the model's predictions, the shorter the encoded message. The code length equals the negative log-likelihood under the model. is the bridge between theory and practice. It turns model predictions into actual compression. If a language model assigns high probability to the next token, arithmetic coding uses fewer bits to represent it. This means the negative log-likelihood-log(p(x)): the number of bits needed to encode data x using a model's probability assignments. Lower loss = better predictions = fewer bits = better compression. of a datapoint under a model directly estimates how many bits are needed to compress it.
Membership Inference Attacks
A membership inference attackA privacy attack that tries to determine whether a specific data point was used to train a machine learning model. It exploits the fact that models typically assign lower loss (higher probability) to data they were trained on. tries to determine whether a particular data sample was used to train a model. The basic idea is simple: models tend to assign lower loss to data they were trained on. By setting a threshold on the loss, an attacker can guess whether a sample was in the training set.
How loss-based membership inference works
The attack is straightforward: compute the model's loss on a target sample. If the loss is below a threshold, guess "member" (in training set). If above, guess "non-member."
This works because models overfit slightly to training data, giving those samples lower loss. The paper measures attack success using F1 scoreA metric combining precision and recall into a single number between 0 and 1. F1 = 0.5 means random guessing (useless), F1 = 1.0 means perfect classification. — a score of 0.5 means the attack is no better than random guessing, while 1.0 means perfect detection.
How It Works
The paper's central contribution is a practical framework for measuring memorization. Here's the complete architecture of their approach:
The Two-Model Trick
The key challenge is distinguishing memorization from generalization. If a model gives a low loss on "2 + 2 = 4", is that because it memorized that specific example, or because it learned arithmetic? The paper solves this with a reference modelA model (θ) representing the "true" data distribution. For synthetic data, it's the known random distribution. For text, it's a larger model trained on a much bigger superset of the data. Any compression the reference model achieves is attributed to generalization, not memorization..
Total Memorization
How much better does the trained model θ̂ compress xi compared to encoding it from scratch?
Unintended Memorization
How much better does θ̂ compress xi compared to the reference model θ?
In practice, the reference model is either:
- For synthetic data: the known uniform distribution (since random bitstrings have no patterns to generalize)
- For real text: a larger model of the same architecture, trained on a much larger superset of the data. Anything this "oracle" model can also predict is generalization, not memorization.
From Theory to Practice
While Kolmogorov complexity is uncomputable in theory (you can't find the shortest possible program for arbitrary data), the paper uses arithmetic coding to approximate it. The beauty is that arithmetic coding's code length equals the negative log-likelihood under the model — a quantity we can compute directly.
Experiments
Synthetic Data: Pure Capacity Measurement
The researchers first eliminate generalization entirely by training on random bitstrings — sequences where each token is uniformly random and independent. With no patterns to learn, every bit the model absorbs is pure memorization.
They trained hundreds of GPT-2 style transformers from 100K to 20M parameters, each on datasets ranging from thousands to millions of random sequences.
The results were strikingly clean:
- Small datasets are completely memorized by models with enough capacity
- As datasets grow, memorization hits a hard plateau regardless of how much more data you provide
- This plateau scales linearly with parameter count at approximately 3.6 bits per parameter (in bfloat16 precision)
| Model Depth | Dimensions | Parameters | Capacity (bf16) | Bits/Param |
|---|---|---|---|---|
| 1 layer | 128 | 469K | 1.69M bits | 3.61 |
| 2 layers | 128 | 667K | 2.60M bits | 3.89 |
| 4 layers | 128 | 1.06M | 3.75M bits | 3.53 |
| 8 layers | 128 | 1.86M | 6.49M bits | 3.49 |
| 8 layers | 256 | 6.86M | 25.1M bits | 3.65 |
Does precision matter?
The researchers also trained in full fp32 precision (32-bit floating point vs 16-bit bfloat16). Doubling the precision from bf16 to fp32 only increased capacity from 3.51 to 3.83 bits per parameter on average — far less than the 2x you might expect. Most of the extra precision bits aren't used for raw storage.
Real Text: Memorization vs. Generalization
Next, the researchers repeated their experiments with real text from the FineWebA large-scale, high-quality web text dataset with state-of-the-art deduplication. Used as the source of real-world text in this paper's experiments. dataset. Unlike random bitstrings, real text has learnable patterns — grammar, common phrases, factual knowledge. Now the model's storage splits between memorization and generalization.
Using a larger "oracle" model as the reference, they observed:
- Small datasets: The trained model memorizes extensively, achieving better compression than even the oracle (because the oracle hasn't seen those specific examples)
- Medium datasets: Unintended memorization peaks, then declines as the model starts substituting generalization for raw memorization
- Large datasets: Most of the model's knowledge becomes general language understanding, and per-sample memorization drops to near zero
Double Descent
One of the paper's most compelling findings is a clean explanation for double descentA phenomenon where test loss decreases, then increases (gets worse), then decreases again as training set size or model size grows. It was previously observed but lacked a clean explanation. — a phenomenon that has puzzled researchers for years.
In classical machine learning, the expectation is simple: more data = better performance. But researchers observed that as datasets grow, performance sometimes gets worse before getting better again. The paper shows this happens at a very specific point:
Dataset < Capacity
The model has enough capacity to memorize every sample. Training loss goes to zero. Test loss is decent because memorization helps with similar test samples.
Dataset ≈ Capacity
The danger zone. The model can't memorize everything but hasn't learned to generalize efficiently yet. It's stretched thin, doing neither well. Test loss spikes.
Dataset >> Capacity
The model is forced to find efficient, reusable patterns. It generalizes well and test loss drops below the original level. This is where real learning happens.
The Critical Ratio
Double descent begins exactly when the dataset-to-capacity ratio crosses 1.0. For a model with capacity C bits and a dataset of D bits of information: the transition happens at D/C ≈ 1.
Membership Inference
The paper's final major contribution is a scaling lawA mathematical relationship that predicts how a quantity (like model performance or memorization) changes as you scale up model size, data size, or compute. Scaling laws let researchers predict the behavior of larger systems from smaller experiments. for membership inference — an equation that predicts exactly when privacy attacks will succeed based on model capacity and dataset size.
The Scaling Law
For a fixed model capacity, membership inference follows a sigmoidalAn S-shaped curve. In this context, membership inference success transitions from near-perfect (1.0) for tiny datasets to random guessing (0.5) for very large datasets, with a smooth S-shaped transition in between. curve with respect to dataset size:
Where σ is the sigmoid function and the fitted constants are c1 = 1.34, c2 = −0.034, c3 = −33.14.
The intuitive reading:
- Large model, tiny dataset → Capacity/|D| is huge → F1 near 1.0 (attack easily succeeds)
- Small model, huge dataset → Capacity/|D| is tiny → F1 near 0.5 (attack is random guessing)
- The transition between these regimes is smooth and predictable
Validation on Larger Models
The researchers validated their scaling law by training GPT-2 Medium (124M params) and GPT-2 XL (1.5B params) on dataset sizes predicted to give specific F1 scores:
| Model | Parameters | Dataset Size | Predicted F1 | Observed F1 |
|---|---|---|---|---|
| GPT-2 XL | 1.56B | 170.7M | 0.55 | 0.546 ± 0.013 |
| GPT-2 XL | 1.56B | 76.8M | 0.75 | 0.711 ± 0.004 |
| GPT-2 XL | 1.56B | 18.9M | 0.95 | 0.959 ± 0.008 |
| GPT-2 Med | 124M | 13.6M | 0.55 | 0.534 ± 0.011 |
| GPT-2 Med | 124M | 6.1M | 0.75 | 0.657 ± 0.006 |
| GPT-2 Med | 124M | 1.5M | 0.95 | 0.980 ± 0.003 |
Predictions were generally within 1–2 percentage points of actual values. The largest discrepancy was at the predicted F1 of 0.75, where the sigmoid is steepest (small changes in the ratio produce large changes in F1).
Implications for Modern LLMs
Contemporary language models are trained with a tokens-per-parameter ratio of 100x or more (e.g., Llama 3 with 8B parameters was trained on 15 trillion tokens). According to this scaling law, that puts their predicted membership inference F1 at essentially 0.5 — random guessing.
Results
Key Findings Summary
| Finding | Value | Significance |
|---|---|---|
| Bits-per-parameter (bf16) | 3.51 ± 0.1 (avg) / ~3.6 (large models) | Universal capacity constant for GPT-style models |
| Bits-per-parameter (fp32) | 3.83 ± 0.1 | Only 9% increase despite 2x precision |
| Double descent onset | Dataset/Capacity ≈ 1 | First precise prediction of the transition point |
| Scaling law accuracy | ± 1.5% F1 | Predictions match observations on 125M–1.5B param models |
| Modern LLM MI vulnerability | F1 ≈ 0.5 | Attacks are no better than random guessing |
Extraction vs. Membership Inference
An interesting finding: membership inference is strictly easier than data extraction. In some cases, membership inference achieved F1 of 0.97 while the extraction rate for the same model-dataset pair was 0.0. This makes sense — detecting a statistical trace of training data is easier than reproducing it verbatim.
For very small training sets with 32-token prefixes, 100% of sequences were extractable. But as datasets grow, extraction converges to the test set extraction rate — meaning all remaining "extraction" is just generalization (the model can produce the text because it learned the underlying patterns, not because it memorized that specific sample).
Deduplication Matters
The researchers found that careful deduplicationThe process of removing duplicate or near-duplicate entries from a dataset. Critical for accurate memorization measurement, since duplicated data is artificially easier to extract. was "extremely important for faithfully measuring extraction rates." When sequences are truncated to 64 tokens, 1–2% become duplicates — enough to significantly bias results. They performed an additional deduplication step on top of FineWeb's existing deduplication.
Final Quiz
Why This Paper Matters
For Practitioners and Builders
If you're building systems with language models, this paper gives you a practical rule of thumb: a model can store about 3.6 bits per parameter. That's roughly 0.45 bytes per parameter, or about 430 MB for a 1B-parameter model. This means:
- Fine-tuning budget: If you're fine-tuning a model on proprietary data, you now know roughly how much it can absorb. A 7B model can memorize about 3 GB of information — but in practice, you want the model to generalize, not memorize, so your dataset should be significantly larger than this capacity.
- Privacy engineering: The scaling law provides a quantitative framework for assessing whether your training data is at risk. If your data-to-capacity ratio is above ~100x, membership inference is essentially impossible on the average sample.
- Model sizing: If you need a model that memorizes specific knowledge (like a knowledge base), this paper tells you exactly how many parameters you need per bit of knowledge.
For the Research Community
This paper makes several contributions that change how we think about language models:
- A rigorous definition of memorization. Previous definitions (can the model generate it? can an attacker extract it?) conflated memorization with generalization. This paper's compression-based definition cleanly separates the two using Kolmogorov complexity.
- A clean explanation of double descent. Instead of hand-waving about "interpolation thresholds," the paper shows double descent begins at a precise, measurable point: when dataset information exceeds model capacity in bits.
- Quantitative privacy analysis. The scaling law moves privacy discussions from qualitative ("is it possible to extract data?") to quantitative ("the F1 score at this model/data ratio is exactly X").
The Bigger Picture
This paper sits at the intersection of information theory, privacy, and deep learning scaling laws — three fields that are converging as AI systems grow. The key insight — that neural networks have a measurable, finite capacity for storing information — has implications far beyond this paper:
- Scaling laws meet information theory. The ~3.6 bits-per-parameter constant suggests there's a fundamental limit to how efficiently gradient-descent-trained networks can use their parameters for storage. Future architectures might push this limit higher.
- The memorization-generalization spectrum. This paper reframes the classic machine learning debate as a capacity allocation problem: models have finite bits, and they choose (via gradient descent) how to spend them between memorizing specifics and learning patterns.
- Data governance at scale. As regulations like GDPR require understanding what models "know" about individuals, having a rigorous, quantitative measure of per-sample memorization becomes legally and ethically important.