How Much Do Language Models Memorize? The Paper, Explained

A beginner-friendly guide to measuring what language models actually remember. Every AI term is defined. Every concept is grounded in analogy.

Paper by John X. Morris, Chawin Sitawarin, Chuan Guo, Narine Kokhlikyan, G. Edward Suh, Alexander M. Rush, Kamalika Chaudhuri, Saeed Mahloujifar (FAIR at Meta, Google DeepMind, Cornell, NVIDIA, 2025) · Explainer published April 2026

made withHyperFrames Language models memorize ~3.6 bits per parameter, then switch from memorizing to generalizing

The Big Picture

When you train a language modelA computer program that learns to predict the next word in a sequence, trained on massive amounts of text. GPT, Llama, and Claude are all language models. on billions of sentences, does it actually remember specific training examples? Or does it just learn general patterns? This paper proposes a rigorous way to answer that question — and the answer reveals a fundamental constant of how neural networks store information.

Here's what the researchers found:

A new way to measure memorization. They define memorization in terms of compressionRepresenting data in fewer bits. If a model helps you compress a specific datapoint more, it has memorized more about that datapoint. — how many fewer bits does it take to describe a training example when you have the model available? This cleanly separates what the model memorized from what it learned generally.
A universal capacity constant. GPT-style transformerThe neural network architecture behind most modern language models. It processes text using an attention mechanism that lets every word consider every other word. models can store approximately 3.6 bits of information per parameter. A 1-billion-parameter model can memorize about 3.6 billion bits (roughly 430 megabytes) of data.
Capacity fills, then generalization begins. Models first memorize training data until their capacity is full. Once capacity is exhausted, they start generalizing — learning reusable patterns instead of storing individual examples. This transition explains the mysterious double descentA surprising phenomenon where a model's test performance first gets worse, then better again as dataset size increases. It contradicts the classical expectation that more data always helps. phenomenon.
Scaling laws for privacy. They produce equations predicting when membership inferenceAn attack that tries to determine whether a specific data sample was used to train a model. If successful, it reveals private information about the training data. attacks can succeed, showing that modern LLMs are trained on far too much data for such attacks to work on average.

GPT-style models have a hard capacity limit of ~3.6 bits per parameter. Once that fills up, models stop memorizing and start generalizing. This single number predicts double descent, extraction success, and membership inference vulnerability.

Background Concepts

Information Theory Basics

Before we can measure memorization, we need to understand how to measure information itself. Information theoryA branch of mathematics founded by Claude Shannon in 1948 that quantifies information using bits. It defines how much data is needed to represent or communicate something., founded by Claude Shannon in 1948, gives us the tools.

Think of information like surprise. If someone tells you "the sun rose this morning," that's not very informative — you expected it. But if they say "it snowed in Miami in July," that's very informative because it's surprising. Information theory measures exactly how surprising (and therefore informative) each piece of data is, in bits.

Entropy: measuring uncertainty

EntropyA measure of uncertainty or randomness in data. Higher entropy means more unpredictable data, requiring more bits to represent. Denoted H(X). measures how unpredictable a data source is. A fair coin has 1 bit of entropy — each flip is completely unpredictable and requires exactly 1 bit to record. A loaded coin that lands heads 99% of the time has very low entropy — it's mostly predictable.

For a dataset X, H(X) tells you the minimum number of bits needed to represent all the data. A dataset of completely random strings has maximum entropy — each string is equally likely, so there's no shortcut to compress them.

Mutual information: shared knowledge

Mutual informationA measure of how much knowing one thing tells you about another. I(X,Y) = H(X) - H(X|Y): how much uncertainty about X is reduced by knowing Y. measures how much knowing one variable tells you about another. If knowing a model's parameters lets you compress a datapoint into fewer bits, the mutual information between the model and that datapoint tells you exactly how many bits of "knowledge" the model has about it.

The key equation: I(X, θ̂) = H(X) − H(X | θ̂)

In words: the information a model θ̂ has about dataset X equals the total information in X minus whatever information remains unknown even after looking at the model.

Kolmogorov Complexity

Shannon's entropy works for distributions (random variables), but we need to measure memorization of specific datapoints. This is where Kolmogorov complexityThe length of the shortest computer program that produces a given string. It measures the inherent complexity of a specific piece of data, regardless of probability distributions. comes in — it measures the complexity of a single string, not a distribution.

Imagine you need to describe a painting to someone over the phone. A painting of a solid blue square can be described in a few words ("10x10 blue square"). A Jackson Pollock painting requires describing every splatter. Kolmogorov complexity is like the length of that phone call — the shortest possible description for that specific piece of data.

Why compression = memorization

Here's the key insight that makes this paper work: if a trained model helps you compress a datapoint into fewer bits than you could without the model, then the model must have memorized something about that datapoint.

How much was memorized? Exactly the difference in compression: mem(x, θ̂) = H^K(x) − H^K(x | θ̂)

The number of bits a datapoint x can be shortened when the model θ̂ is available to help compress it. If the model knows nothing about x, it can't help compress it, and memorization is zero.

Arithmetic Coding

Arithmetic codingA compression algorithm that encodes data using the probabilities assigned by a model. The better the model's predictions, the shorter the encoded message. The code length equals the negative log-likelihood under the model. is the bridge between theory and practice. It turns model predictions into actual compression. If a language model assigns high probability to the next token, arithmetic coding uses fewer bits to represent it. This means the negative log-likelihood-log(p(x)): the number of bits needed to encode data x using a model's probability assignments. Lower loss = better predictions = fewer bits = better compression. of a datapoint under a model directly estimates how many bits are needed to compress it.

Arithmetic coding works like a smart filing system. If you told a librarian "the next book will probably be about cooking" and you're right, filing it is quick — it goes in the cooking section. If you said "cooking" but it's about astrophysics, filing takes longer because the librarian was looking in the wrong place. The model's prediction accuracy directly determines compression efficiency.

Membership Inference Attacks

A membership inference attackA privacy attack that tries to determine whether a specific data point was used to train a machine learning model. It exploits the fact that models typically assign lower loss (higher probability) to data they were trained on. tries to determine whether a particular data sample was used to train a model. The basic idea is simple: models tend to assign lower loss to data they were trained on. By setting a threshold on the loss, an attacker can guess whether a sample was in the training set.

How loss-based membership inference works

The attack is straightforward: compute the model's loss on a target sample. If the loss is below a threshold, guess "member" (in training set). If above, guess "non-member."

This works because models overfit slightly to training data, giving those samples lower loss. The paper measures attack success using F1 scoreA metric combining precision and recall into a single number between 0 and 1. F1 = 0.5 means random guessing (useless), F1 = 1.0 means perfect classification. — a score of 0.5 means the attack is no better than random guessing, while 1.0 means perfect detection.

How It Works

The paper's central contribution is a practical framework for measuring memorization. Here's the complete architecture of their approach:

Input

Training Dataset x

A collection of text sequences (or synthetic bitstrings)

Training

Train Target Model θ̂

GPT-style transformer trained from scratch on x

Core Measurement
Compute Compression Rates
For each sample xi: compute −log p(xi | θ̂) and −log p(xi | θ)

Decomposition

Separate Memorization Components

Total = Unintended Memorization + Generalization

Output

Per-Sample Memorization (in bits)

Exact measurement of how much θ̂ "knows" about each datapoint

The Two-Model Trick

The key challenge is distinguishing memorization from generalization. If a model gives a low loss on "2 + 2 = 4", is that because it memorized that specific example, or because it learned arithmetic? The paper solves this with a reference modelA model (θ) representing the "true" data distribution. For synthetic data, it's the known random distribution. For text, it's a larger model trained on a much bigger superset of the data. Any compression the reference model achieves is attributed to generalization, not memorization..

Total Memorization

How much better does the trained model θ̂ compress x_i compared to encoding it from scratch?

mem(x, θ̂) = H^K(x) − H^K(x | θ̂)

Unintended Memorization

How much better does θ̂ compress x_i compared to the reference model θ?

mem_U(x, θ, θ̂) = H^K(x | θ) − H^K(x | θ, θ̂)

In practice, the reference model is either:

For synthetic data: the known uniform distribution (since random bitstrings have no patterns to generalize)
For real text: a larger model of the same architecture, trained on a much larger superset of the data. Anything this "oracle" model can also predict is generalization, not memorization.

The compression gap between the trained model and the reference model is pure "unintended memorization" — information the model absorbed about specific training samples that goes beyond general language patterns.

From Theory to Practice

While Kolmogorov complexity is uncomputable in theory (you can't find the shortest possible program for arbitrary data), the paper uses arithmetic coding to approximate it. The beauty is that arithmetic coding's code length equals the negative log-likelihood under the model — a quantity we can compute directly.

Feed x_i to θ̂

Get trained model's loss

→

Feed x_i to θ

Get reference model's loss

→

Best compression

max(prob) = min(loss)

→

Subtract

loss_θ − min loss

→

Sum

Σ per-sample bits

Experiments

Synthetic Data: Pure Capacity Measurement

The researchers first eliminate generalization entirely by training on random bitstrings — sequences where each token is uniformly random and independent. With no patterns to learn, every bit the model absorbs is pure memorization.

They trained hundreds of GPT-2 style transformers from 100K to 20M parameters, each on datasets ranging from thousands to millions of random sequences.

made withHyperFrames How models fill their bit capacity with memorized data, then hit the 3.6 bits-per-parameter wall

The results were strikingly clean:

Small datasets are completely memorized by models with enough capacity
As datasets grow, memorization hits a hard plateau regardless of how much more data you provide
This plateau scales linearly with parameter count at approximately 3.6 bits per parameter (in bfloat16 precision)

Model Depth	Dimensions	Parameters	Capacity (bf16)	Bits/Param
1 layer	128	469K	1.69M bits	3.61
2 layers	128	667K	2.60M bits	3.89
4 layers	128	1.06M	3.75M bits	3.53
8 layers	128	1.86M	6.49M bits	3.49
8 layers	256	6.86M	25.1M bits	3.65

Does precision matter?

The researchers also trained in full fp32 precision (32-bit floating point vs 16-bit bfloat16). Doubling the precision from bf16 to fp32 only increased capacity from 3.51 to 3.83 bits per parameter on average — far less than the 2x you might expect. Most of the extra precision bits aren't used for raw storage.

It's like upgrading from a pocket notebook to a full-size legal pad. You get some extra space, but you don't write twice as many notes — your handwriting stays about the same size, so you only use a fraction of the extra room.

Real Text: Memorization vs. Generalization

Next, the researchers repeated their experiments with real text from the FineWebA large-scale, high-quality web text dataset with state-of-the-art deduplication. Used as the source of real-world text in this paper's experiments. dataset. Unlike random bitstrings, real text has learnable patterns — grammar, common phrases, factual knowledge. Now the model's storage splits between memorization and generalization.

Using a larger "oracle" model as the reference, they observed:

Small datasets: The trained model memorizes extensively, achieving better compression than even the oracle (because the oracle hasn't seen those specific examples)
Medium datasets: Unintended memorization peaks, then declines as the model starts substituting generalization for raw memorization
Large datasets: Most of the model's knowledge becomes general language understanding, and per-sample memorization drops to near zero

When dataset size exceeds model capacity, models are forced to generalize. They can no longer afford to memorize each sample individually, so they find shared patterns to compress information more efficiently.

Double Descent

One of the paper's most compelling findings is a clean explanation for double descentA phenomenon where test loss decreases, then increases (gets worse), then decreases again as training set size or model size grows. It was previously observed but lacked a clean explanation. — a phenomenon that has puzzled researchers for years.

made withHyperFrames How test loss behaves as dataset size crosses the model's capacity threshold

In classical machine learning, the expectation is simple: more data = better performance. But researchers observed that as datasets grow, performance sometimes gets worse before getting better again. The paper shows this happens at a very specific point:

Dataset < Capacity

The model has enough capacity to memorize every sample. Training loss goes to zero. Test loss is decent because memorization helps with similar test samples.

Dataset ≈ Capacity

The danger zone. The model can't memorize everything but hasn't learned to generalize efficiently yet. It's stretched thin, doing neither well. Test loss spikes.

Dataset >> Capacity

The model is forced to find efficient, reusable patterns. It generalizes well and test loss drops below the original level. This is where real learning happens.

The Critical Ratio

Double descent begins exactly when the dataset-to-capacity ratio crosses 1.0. For a model with capacity C bits and a dataset of D bits of information: the transition happens at D/C ≈ 1.

Imagine a student cramming for an exam. With 10 flashcards, they can memorize all 10 perfectly. With 100 flashcards but only room in their head for 50, they're in trouble — they half-remember everything and fully remember nothing. But with 1,000 flashcards, they realize they need to learn the patterns (this type of question always works like this), and suddenly they understand the material at a deeper level.

Membership Inference

The paper's final major contribution is a scaling lawA mathematical relationship that predicts how a quantity (like model performance or memorization) changes as you scale up model size, data size, or compute. Scaling laws let researchers predict the behavior of larger systems from smaller experiments. for membership inference — an equation that predicts exactly when privacy attacks will succeed based on model capacity and dataset size.

made withHyperFrames How membership inference success scales with the capacity-to-dataset ratio

The Scaling Law

For a fixed model capacity, membership inference follows a sigmoidalAn S-shaped curve. In this context, membership inference success transitions from near-perfect (1.0) for tiny datasets to random guessing (0.5) for very large datasets, with a smooth S-shaped transition in between. curve with respect to dataset size:

MembershipF1(θ, D) = ½(1 + c₁ · σ(c₂ · (Capacity(θ) / |D| + c₃)))

Where σ is the sigmoid function and the fitted constants are c₁ = 1.34, c₂ = −0.034, c₃ = −33.14.

The intuitive reading:

Large model, tiny dataset → Capacity/|D| is huge → F1 near 1.0 (attack easily succeeds)
Small model, huge dataset → Capacity/|D| is tiny → F1 near 0.5 (attack is random guessing)
The transition between these regimes is smooth and predictable

Validation on Larger Models

The researchers validated their scaling law by training GPT-2 Medium (124M params) and GPT-2 XL (1.5B params) on dataset sizes predicted to give specific F1 scores:

Model	Parameters	Dataset Size	Predicted F1	Observed F1
GPT-2 XL	1.56B	170.7M	0.55	0.546 ± 0.013
GPT-2 XL	1.56B	76.8M	0.75	0.711 ± 0.004
GPT-2 XL	1.56B	18.9M	0.95	0.959 ± 0.008
GPT-2 Med	124M	13.6M	0.55	0.534 ± 0.011
GPT-2 Med	124M	6.1M	0.75	0.657 ± 0.006
GPT-2 Med	124M	1.5M	0.95	0.980 ± 0.003

Predictions were generally within 1–2 percentage points of actual values. The largest discrepancy was at the predicted F1 of 0.75, where the sigmoid is steepest (small changes in the ratio produce large changes in F1).

Implications for Modern LLMs

Contemporary language models are trained with a tokens-per-parameter ratio of 100x or more (e.g., Llama 3 with 8B parameters was trained on 15 trillion tokens). According to this scaling law, that puts their predicted membership inference F1 at essentially 0.5 — random guessing.

This paper provides formal evidence for why membership inference attacks fail on modern LLMs: the ratio of data to model capacity is so enormous that individual training samples leave no detectable trace.

Results

Key Findings Summary

Finding	Value	Significance
Bits-per-parameter (bf16)	3.51 ± 0.1 (avg) / ~3.6 (large models)	Universal capacity constant for GPT-style models
Bits-per-parameter (fp32)	3.83 ± 0.1	Only 9% increase despite 2x precision
Double descent onset	Dataset/Capacity ≈ 1	First precise prediction of the transition point
Scaling law accuracy	± 1.5% F1	Predictions match observations on 125M–1.5B param models
Modern LLM MI vulnerability	F1 ≈ 0.5	Attacks are no better than random guessing

Extraction vs. Membership Inference

An interesting finding: membership inference is strictly easier than data extraction. In some cases, membership inference achieved F1 of 0.97 while the extraction rate for the same model-dataset pair was 0.0. This makes sense — detecting a statistical trace of training data is easier than reproducing it verbatim.

For very small training sets with 32-token prefixes, 100% of sequences were extractable. But as datasets grow, extraction converges to the test set extraction rate — meaning all remaining "extraction" is just generalization (the model can produce the text because it learned the underlying patterns, not because it memorized that specific sample).

Deduplication Matters

The researchers found that careful deduplicationThe process of removing duplicate or near-duplicate entries from a dataset. Critical for accurate memorization measurement, since duplicated data is artificially easier to extract. was "extremely important for faithfully measuring extraction rates." When sequences are truncated to 64 tokens, 1–2% become duplicates — enough to significantly bias results. They performed an additional deduplication step on top of FineWeb's existing deduplication.

Why This Paper Matters

For Practitioners and Builders

If you're building systems with language models, this paper gives you a practical rule of thumb: a model can store about 3.6 bits per parameter. That's roughly 0.45 bytes per parameter, or about 430 MB for a 1B-parameter model. This means:

Fine-tuning budget: If you're fine-tuning a model on proprietary data, you now know roughly how much it can absorb. A 7B model can memorize about 3 GB of information — but in practice, you want the model to generalize, not memorize, so your dataset should be significantly larger than this capacity.
Privacy engineering: The scaling law provides a quantitative framework for assessing whether your training data is at risk. If your data-to-capacity ratio is above ~100x, membership inference is essentially impossible on the average sample.
Model sizing: If you need a model that memorizes specific knowledge (like a knowledge base), this paper tells you exactly how many parameters you need per bit of knowledge.

For the Research Community

This paper makes several contributions that change how we think about language models:

A rigorous definition of memorization. Previous definitions (can the model generate it? can an attacker extract it?) conflated memorization with generalization. This paper's compression-based definition cleanly separates the two using Kolmogorov complexity.
A clean explanation of double descent. Instead of hand-waving about "interpolation thresholds," the paper shows double descent begins at a precise, measurable point: when dataset information exceeds model capacity in bits.
Quantitative privacy analysis. The scaling law moves privacy discussions from qualitative ("is it possible to extract data?") to quantitative ("the F1 score at this model/data ratio is exactly X").

The Bigger Picture

This paper sits at the intersection of information theory, privacy, and deep learning scaling laws — three fields that are converging as AI systems grow. The key insight — that neural networks have a measurable, finite capacity for storing information — has implications far beyond this paper:

Scaling laws meet information theory. The ~3.6 bits-per-parameter constant suggests there's a fundamental limit to how efficiently gradient-descent-trained networks can use their parameters for storage. Future architectures might push this limit higher.
The memorization-generalization spectrum. This paper reframes the classic machine learning debate as a capacity allocation problem: models have finite bits, and they choose (via gradient descent) how to spend them between memorizing specifics and learning patterns.
Data governance at scale. As regulations like GDPR require understanding what models "know" about individuals, having a rigorous, quantitative measure of per-sample memorization becomes legally and ethically important.

Arithmetic Coding

A compression algorithm that encodes data using model probabilities. The code length equals the negative log-likelihood, connecting compression to model loss.

Bits-per-Parameter (BPP)

The ratio of a model's total memorization capacity (in bits) to its parameter count. GPT-style models achieve ~3.6 BPP in bfloat16 precision.

Compression

Representing data in fewer bits. If a model helps compress a datapoint more, it has memorized information about that datapoint.

Deduplication

Removing duplicate or near-duplicate entries from a dataset. Critical for accurate memorization measurement since duplicates are artificially easier to extract.

Double Descent

A phenomenon where test loss first decreases, then increases, then decreases again as dataset size or model size grows. This paper shows it begins when dataset information exceeds model capacity.

Entropy

A measure of uncertainty or randomness in data, denoted H(X). Higher entropy means more unpredictable data requiring more bits to represent.

F1 Score

A metric combining precision and recall. F1 = 0.5 means random guessing, F1 = 1.0 means perfect classification. Used here to measure membership inference success.

FineWeb

A large-scale web text dataset with state-of-the-art deduplication practices. Used as the real-world text data source in this paper.

Generalization

Learning reusable patterns from data (grammar rules, arithmetic, common knowledge) as opposed to memorizing specific training examples.

Information Theory

A branch of mathematics founded by Claude Shannon in 1948 that quantifies information using bits. Provides the foundation for measuring memorization.

Kolmogorov Complexity

The length of the shortest computer program that produces a given string. Measures inherent complexity of specific data, not distributions.

Language Model

A program that predicts the next word in a sequence, trained on massive text. GPT, Llama, and Claude are language models.

Membership Inference

A privacy attack that determines whether a specific sample was used to train a model by exploiting the model's lower loss on training data.

Model Capacity

The maximum amount of information (in bits) a model can memorize. Computed as the maximum memorization across all dataset sizes.

Mutual Information

A measure of how much knowing one thing tells you about another. I(X,Y) = H(X) - H(X|Y): uncertainty about X reduced by knowing Y.

Negative Log-Likelihood

-log(p(x)): the number of bits needed to encode data x using a model's probability assignments. Lower = better predictions = better compression.

Reference Model

A model representing the "true" data distribution, used to separate memorization from generalization. For text, it's a larger model trained on more data.

Scaling Law

A mathematical relationship predicting how a quantity changes as you scale up model size, data, or compute. This paper derives a scaling law for membership inference.

Sigmoid Function

An S-shaped function σ(x) = 1/(1+e^(-x)) that maps any real number to a value between 0 and 1. Used to model the transition in membership inference success.

Transformer

The neural network architecture behind most modern language models. Uses attention mechanisms to process text by letting every word consider every other word.

Unintended Memorization

Information a model retains about specific training samples beyond general language patterns. The "privacy-sensitive" part of what a model knows.