LoRA: Low-Rank Adaptation of Large Language Models — The Paper, Explained
A beginner-friendly guide to parameter-efficient fine-tuning. Every AI term is defined. Every concept is grounded in analogy.
The Big Picture
Large language models like GPT-3 are incredibly powerful, but adapting them to specific tasks has a massive problem: fine-tuning all 175 billion parameters is prohibitively expensive. You need enormous GPU memory, you create a separate 175B-parameter copy for every task, and training takes days. In 2021, this was the central bottleneck of applied NLP.
Here's what the paper solves:
- Storage explosion — Full fine-tuning of GPT-3 creates a separate 350 GB checkpoint per task. With LoRA, each task adds only ~35 MB of new parameters — a 10,000x reduction.
- GPU memory bottleneck — Full fine-tuning needs to store gradients and optimizer states for all parameters. LoRA reduces GPU memory requirements by up to 3x because only the tiny adapter matrices are trained.
- Inference latency — Unlike prior methods like adapter layersSmall neural network modules inserted between existing Transformer layers. They add extra computation at every forward pass, which increases inference latency — the time it takes the model to generate each token., LoRA adds zero additional latency at inference time because its matrices can be merged directly back into the original weights.
- Task switching — Swapping between tasks only requires loading a small set of LoRA weights (~35 MB) rather than an entire model checkpoint (~350 GB).
Background Concepts
Transformers and Self-Attention
The TransformerThe dominant neural network architecture behind modern language models (GPT, BERT, etc.). It processes text using self-attention, which lets every word consider every other word in the input to build up contextual representations. is the architecture behind virtually every modern language model. At its heart is the self-attentionA mechanism where each token in the input computes relevance scores against every other token, then takes a weighted combination of their representations. This allows the model to capture long-range dependencies in text. mechanism, which lets each word in the input "look at" all other words to understand context.
Deep dive: How self-attention uses weight matrices
Each attention layer has four key weight matrices:
- Wq (Query projection) — Transforms each token into a "query" vector: "What am I looking for?"
- Wk (Key projection) — Transforms each token into a "key" vector: "What do I contain?"
- Wv (Value projection) — Transforms each token into a "value" vector: "What information should I pass along?"
- Wo (Output projection) — Combines the attention results back into the model's main representation
Each of these matrices is large — for GPT-3, they are 12,288 × 12,288 (over 150 million parameters each). A Transformer stacks many such layers (96 in GPT-3), and each layer has its own set of these matrices.
Fine-Tuning
Fine-tuningThe process of taking a pretrained model (one already trained on a huge general dataset) and continuing to train it on a smaller, task-specific dataset. This adapts the model's general knowledge to perform well on a particular task like sentiment analysis or summarization. is how you take a general-purpose pretrained model and specialize it for a specific task. You take a model that already knows language (from pretraining on trillions of words) and train it further on your task-specific data (e.g., customer support conversations, medical records, or legal documents).
Full Fine-Tuning
Update all parameters in the model. For GPT-3 (175B parameters), this means storing and updating 175 billion weights, their gradients, and optimizer states — requiring many high-end GPUs and creating a full model copy per task.
Parameter-Efficient Fine-Tuning
Only update a small subset of parameters or add small new modules, keeping most of the model frozen. LoRA falls in this category, reducing trainable parameters by 10,000x while preserving quality.
Weight Matrices
A neural network is fundamentally a stack of weight matricesTwo-dimensional arrays of numbers that define the linear transformations in a neural network. Each layer multiplies its input by a weight matrix to produce its output. In GPT-3, these matrices can be as large as 12,288 × 12,288. — large grids of numbers. When the model processes text, it multiplies input vectors by these weight matrices at every layer. In GPT-3, a single weight matrix in the attention mechanism has dimensions 12,288 × 12,288, which means over 150 million numbers in just one matrix. The entire model has 96 layers, each with multiple such matrices.
Matrix Rank
Matrix rankA measure of the "information dimension" of a matrix. A matrix of rank r means its rows (or columns) span an r-dimensional subspace. A 12,288 × 12,288 matrix could theoretically have rank 12,288, but if its information can be captured in just 4 dimensions, it has an "intrinsic rank" of 4. is a fundamental concept from linear algebra. It measures how much independent information a matrix actually contains. A 12,288 × 12,288 matrix could theoretically require all 12,288 dimensions to describe — but if most of its information is redundant, it might have an effective rank of just 4 or even 1.
A low-rank matrixA matrix whose rank is much smaller than its dimensions. A 12,288 × 12,288 matrix with rank 4 can be perfectly represented as the product of a 12,288 × 4 matrix times a 4 × 12,288 matrix — using 98,304 numbers instead of 150 million. can be expressed as the product of two much smaller matrices. A rank-r matrix of size d × d can be written as B × A, where B is d × r and A is r × d. When r is much smaller than d, this is a massive compression.
Example: 12,288 × 12,288 = 150M params → (12,288 × 4) + (4 × 12,288) = 98,304 params
Adapter Layers
Adapter layersSmall bottleneck modules inserted between existing Transformer layers. Proposed by Houlsby et al. (2019), they project the hidden state down to a small dimension, apply a nonlinearity, and project back up. Effective but add inference latency because they introduce new sequential computation. (Houlsby et al., 2019) were one of the first parameter-efficient approaches. They insert small bottleneck modules between existing Transformer layers: project down to a small dimension, apply a nonlinearity, and project back up.
The problem: adapter layers add extra sequential computation at every layer during inference. Even a small adapter introduces measurable latency, which is unacceptable for latency-sensitive production deployments. The paper measures that adapters in GPT-2 can increase latency by 20-30% even with very small bottleneck dimensions.
Prefix Tuning
Prefix tuningA parameter-efficient method (Li and Liang, 2021) that prepends trainable "virtual tokens" to the input at every layer. These virtual tokens are optimized during training and steer the model's behavior. However, they consume part of the model's limited context window. (Li and Liang, 2021) takes a different approach: instead of modifying weights, it prepends trainable "virtual tokens" to every layer. These soft prompts steer the model's behavior without changing its weights.
The problem: prefix tuning is hard to optimize (it's non-monotonic during training) and the virtual tokens consume part of the model's limited context windowThe maximum number of tokens a language model can process at once. For GPT-3 in 2021, this was 2,048 tokens. Any tokens used for prefix tuning's virtual prompts reduce the space available for actual input text., reducing the available sequence length for actual inputs.
Deep dive: Comparison of parameter-efficient methods
Adapter Layers
Approach: Insert bottleneck modules between layers
Pros: Works well, conceptually simple
Cons: Adds inference latency (sequential computation), hard to parallelize
Prefix Tuning
Approach: Prepend learnable virtual tokens at each layer
Pros: No weight modification
Cons: Hard to optimize, reduces usable context length, performance plateaus
LoRA
Approach: Add low-rank update matrices in parallel to existing weights
Pros: Zero inference latency, easy to optimize, small storage, matches full fine-tuning
Cons: Requires choosing which weight matrices to adapt and what rank to use
How It Works
The Core Idea: Low-Rank Decomposition
LoRA's central mechanism is simple and elegant. For a pretrained weight matrix W (of size d × d), instead of updating W directly during fine-tuning, LoRA freezes W and adds a parallel low-rank update:
where B is a d × r matrix and A is an r × d matrix, with r (the rankIn LoRA, the rank r is a hyperparameter that controls the expressiveness of the adaptation. A rank-r update can capture r independent "directions" of change. The paper shows that even r=1 can suffice for many tasks, despite weight matrices being 12,288-dimensional.) being much smaller than d. The product B × A gives you a d × d matrix, but with only r × (d + d) = 2dr trainable parameters instead of d2.
Architecture: The LoRA Reparametrization
B is d × r (initialized to zero, so ΔW = 0 at start)
Scaled by α/r before adding to W · x
Two important initialization details:
- A is initialized randomly (Gaussian) so the low-rank space starts with diverse directions
- B is initialized to zero, which means ΔW = BA = 0 at the start of training — the model starts from the exact pretrained behavior and gradually learns the task-specific adjustments
The output is also scaled by α/r, where αA constant scaling factor in LoRA, typically set to the first rank value tried. When using rank r, the LoRA output is scaled by α/r. This means you can adjust the rank without needing to re-tune the learning rate — the effective learning rate stays roughly constant. is a constant (usually set to the first value of r tried). This scaling means that changing r doesn't require re-tuning the learning rate.
Weight Merging: Zero-Cost Inference
This is LoRA's killer feature, and what separates it from adapter layers. After training, you can merge the LoRA matrices back into the original weights:
The result is a single weight matrix W' that has the exact same dimensions as the original W. During inference, the model uses W' directly — there is no extra computation, no extra memory, no extra latency. The adapted model looks and runs identically to the original architecture.
Applying LoRA to the Transformer
In a standard Transformer layer, there are four weight matrices in the self-attention module (Wq, Wk, Wv, Wo) and two in the feed-forward networkThe MLP (multi-layer perceptron) block in each Transformer layer. It consists of two large weight matrices with a nonlinear activation in between. In GPT-3, these are 12,288 × 49,152 and 49,152 × 12,288 — even larger than the attention matrices. (W1, W2). LoRA can be applied to any subset of these.
The paper's experiments focus on attention weights only and find that adapting the right combination of attention matrices is sufficient to match full fine-tuning. They do not apply LoRA to the feed-forward (MLP) layers in their main experiments (though they note this could provide further gains).
Deep dive: Parameter count comparison for GPT-3
GPT-3 175B has 96 layers. Each layer has attention projection matrices of size 12,288 × 12,288. Applying LoRA with rank r=4 to Wq and Wv:
- Per matrix: A is 4 × 12,288 = 49,152 params, B is 12,288 × 4 = 49,152 params → 98,304 total per matrix
- Per layer: 2 matrices × 98,304 = 196,608 params
- All layers: 96 layers × 196,608 = ~18.9M trainable params
- Ratio: 18.9M / 175,000M = 0.01% of the original model
With rank r=1, this drops even further: ~4.7M trainable parameters, or 0.0027% of the model.
Which Weights to Adapt
Not all weight matrices benefit equally from LoRA. The paper systematically tests which attention projection matrices to adapt, using GPT-3 175B on the WikiSQL and MultiNLI benchmarks. All experiments use a fixed parameter budget of rank r=8, distributing it across different matrix combinations.
| Weight Matrices | Rank r | Trainable Params | WikiSQL Acc. | MultiNLI Acc. |
|---|---|---|---|---|
| Wq only | 8 | 18.9M | 70.4% | 91.0% |
| Wk only | 8 | 18.9M | 70.0% | 90.7% |
| Wv only | 8 | 18.9M | 73.0% | 91.1% |
| Wo only | 8 | 18.9M | 73.2% | 91.3% |
| Wq + Wk | 4 | 18.9M | 71.4% | 91.0% |
| Wq + Wv | 4 | 18.9M | 73.7% | 91.3% |
| Wq + Wk + Wv + Wo | 2 | 18.9M | 73.7% | 91.2% |
The finding that all four matrices at rank 2 performs comparably to Wq + Wv at rank 4 reinforces this: breadth of adaptation matters more than depth. The paper recommends Wq + Wv as the default choice because it offers the best balance of simplicity and performance.
What Rank to Use
Perhaps the most surprising finding in the paper: extremely low ranks work remarkably well. The authors test ranks from 1 to 64 on GPT-3 175B:
| Rank r | Trainable Params | WikiSQL Acc. | MultiNLI Acc. |
|---|---|---|---|
| 1 | 4.7M | 73.4% | 91.2% |
| 2 | 9.4M | 73.3% | 91.7% |
| 4 | 18.9M | 73.7% | 91.3% |
| 8 | 37.7M | 73.7% | 91.0% |
| 64 | 301.9M | 73.6% | 91.4% |
Even rank 1 — meaning each LoRA matrix pair has just two vectors (B is d × 1, A is 1 × d) — achieves performance within 0.3% of rank 64, which has 64x more parameters. Going beyond rank 4 provides essentially no benefit.
Deep dive: Subspace similarity analysis
The authors go deeper by analyzing the actual learned LoRA matrices. They compare the subspaces learned at different ranks by measuring the Grassmann distanceA mathematical measure of the "angle" between two subspaces. If two subspaces are identical, the distance is 0. If they are completely orthogonal (share no common directions), the distance is maximal. Used here to show that LoRA consistently finds the same top-1 direction regardless of rank. between the top singular vectors of ΔW at rank 8 versus rank 64.
Key finding: the top-1 direction of ΔW at rank 8 is contained within the top few directions of ΔW at rank 64. This means smaller ranks capture the most important direction of adaptation, and larger ranks mostly add noise-like directions that don't help performance.
Furthermore, the adaptation matrix ΔW amplifies directions that are not already emphasized by the pretrained weight W. The top singular directions of ΔW have only 0.32% overlap with the top singular directions of W — suggesting LoRA learns genuinely new task-specific features rather than simply scaling up what the model already knows.
Results
RoBERTa and DeBERTa (GLUE Benchmark)
The authors test LoRA on the GLUE benchmarkGeneral Language Understanding Evaluation: a collection of 9 natural language understanding tasks including sentiment analysis (SST-2), textual entailment (MNLI, RTE), sentence similarity (MRPC, STS-B), and more. A standard benchmark for evaluating language models., a standard suite of natural language understanding tasks. They compare against full fine-tuning and other parameter-efficient methods using RoBERTaA robustly optimized version of BERT (Liu et al., 2019). RoBERTa-base has 125M parameters, RoBERTa-large has 355M. It was the dominant NLU model before GPT-3 era and remains a standard benchmark baseline. (125M and 355M parameter versions) and DeBERTaDecoding-enhanced BERT with disentangled attention (He et al., 2021). DeBERTa XXL has 1.5B parameters and was state-of-the-art on many NLU benchmarks at the time of the LoRA paper. Uses a novel disentangled attention mechanism that separates content and position information. (1.5B parameters).
| Method | Model | # Trainable Params | MNLI (Acc.) | SST-2 (Acc.) | MRPC (Acc.) | Avg. GLUE |
|---|---|---|---|---|---|---|
| Full Fine-Tuning | RoBERTa-base | 125M (100%) | 87.6 | 94.8 | 90.2 | 86.4 |
| Adapter (Houlsby) | RoBERTa-base | 0.9M (0.7%) | 87.1 | 94.7 | 89.5 | 85.8 |
| LoRA (r=8) | RoBERTa-base | 0.3M (0.2%) | 87.5 | 95.1 | 89.7 | 86.2 |
| Full Fine-Tuning | RoBERTa-large | 355M (100%) | 90.2 | 96.4 | 90.9 | 89.0 |
| Adapter (Houlsby) | RoBERTa-large | 0.8M (0.2%) | 90.2 | 96.2 | 90.2 | 88.7 |
| LoRA (r=8) | RoBERTa-large | 0.8M (0.2%) | 90.6 | 96.2 | 90.6 | 89.2 |
| Full Fine-Tuning | DeBERTa XXL | 1.5B (100%) | 91.7 | 97.2 | 92.0 | 90.8 |
| LoRA (r=8) | DeBERTa XXL | 4.7M (0.3%) | 91.9 | 97.5 | 92.6 | 91.1 |
GPT-3 175B Results
The most striking results come from GPT-3, where full fine-tuning requires immense computational resources. LoRA makes fine-tuning GPT-3 practical on a single machine:
| Method | # Trainable Params | WikiSQL (Acc.) | MNLI-m (Acc.) | SAMSum (R1/R2/RL) |
|---|---|---|---|---|
| GPT-3 (few-shot, no FT) | 0 | 73.8 | 50.4 | 52.0/28.0/44.5 |
| Full Fine-Tuning | 175,255.8M | 73.8 | 89.5 | 52.0/28.0/44.5 |
| BitFit | 14.2M | 71.3 | 84.7 | 51.3/27.1/43.5 |
| Prefix-Embedding | 3.2M | 63.1 | 88.6 | 49.2/25.7/41.7 |
| Prefix-Layer | 20.2M | 70.1 | 89.4 | 50.8/27.4/43.5 |
| Adapter (Houlsby) | 7.1M | 71.9 | 89.8 | 52.3/28.1/44.5 |
| Adapter (Lin) | 40.1M | 73.2 | 91.5 | 53.2/28.6/44.8 |
| LoRA (r=4) | 4.7M | 73.4 | 91.7 | 53.8/29.8/45.9 |
LoRA with just 4.7 million trainable parameters (0.0027% of the model) matches or outperforms full fine-tuning with 175 billion parameters on all three tasks. It also beats adapters with 8x more trainable parameters, and prefix-based methods by a wide margin.
Efficiency Comparison
| Metric | Full FT | Adapter | Prefix Tuning | LoRA |
|---|---|---|---|---|
| Trainable parameters (GPT-3) | 175B (100%) | 7.1-40M | 3.2-20M | 4.7M (0.0027%) |
| GPU Memory (GPT-2 Medium) | Baseline | Baseline | Baseline | ~2/3 of baseline |
| Checkpoint size (GPT-3) | ~350 GB | ~35 MB | ~20 MB | ~35 MB |
| Additional inference latency | None | +20-30% | Slight | None (merged) |
| Training speed (GPT-2) | Baseline | Slower | Similar | 25% faster |
Final Quiz
Why This Paper Matters
For Practitioners
LoRA democratized fine-tuning of large language models. Before this paper, adapting GPT-3-scale models required enormous compute budgets available only to large companies. After LoRA, anyone with a single GPU could fine-tune a large model by training just the low-rank adapters. This unlocked a wave of customized LLMs for specific domains — medicine, law, code, customer support — built by small teams and individuals. Today, LoRA and its variants (QLoRA, DoRA) are the default approach for fine-tuning open-source LLMs like Llama, Mistral, and Qwen.
For the Research Community
The paper provided strong empirical evidence for a deep theoretical insight: adaptation lives in a low-dimensional subspace. Despite models having billions of parameters, the actual "direction" of change needed for a new task can be captured in just a few dimensions. This inspired an explosion of follow-up work in parameter-efficient fine-tuning, model merging, continual learning, and our understanding of what pretrained models actually learn. The concept of "intrinsic dimensionality" of tasks has become a foundational idea in modern deep learning research.
The Bigger Picture
LoRA changed the economics of AI customization. The traditional approach of "fine-tune a separate copy of the full model for each use case" was a dead end as models grew to hundreds of billions of parameters. LoRA showed that customization can scale independently of model size — a 35 MB adapter file works for a 350 GB model. This insight laid the groundwork for today's ecosystem where a single foundation model serves as a shared base, with thousands of lightweight task-specific adapters that can be loaded, swapped, and combined at runtime. It made the "one model, many specializations" paradigm practical, and fundamentally reshaped how the industry deploys large language models.