Avatar V: The Paper, Explained
A beginner-friendly guide to HeyGen's Avatar V technical report. Every AI term is defined. Every concept is grounded in analogy.
The Big Picture
Avatar V is a system that takes a short video of a real person and generates new, high-quality talking-head videos of that person saying anything you want. The generated person doesn't just look like the original - they move, talk, and gesture like them too.
The Three Problems It Solves
Before Avatar V, existing systems had three big weaknesses:
- Shallow identity: They used a single photo as reference. One photo can't capture how you look from different angles, in different lighting, or with different expressions. So generated videos would "drift" - the person would start looking less like you over time.
- No personality: They could copy your face but not your behavior. Everyone's generated videos looked the same in terms of motion - generic head bobs and lip movements.
- Blurry faces: The AI spreads its learning effort evenly across the entire video frame. But the face (especially lips, teeth, and eyes) is what humans actually care about, and it's a tiny portion of the frame. So faces came out blurry or wrong.
See It In Action
Before diving into the technical details, see what Avatar V actually produces. These demos are from the official research page.
Reference Video vs Generated Output
Given a short reference video of someone (left), Avatar V generates a new video of that person in a different scene, preserving their identity and talking style (right).
Comparison with Other Models
Avatar V is evaluated against Kling O3 Pro, Veo 3.1, OmniHuman 1.5, and Seedance 2.0. These grid comparisons show all models generating from the same inputs.
Can You Tell Real from AI? (Turing Test)
In the paper's Turing test, human annotators were shown a pair of videos — one real, one generated — and asked to pick the real one. Click the video you think is real:
Watch both videos first, then click the button below the one you think is real.
In the paper's evaluation, 61% of test cases fooled at least one trained annotator. More Turing test pairs are on the project page.
Background Concepts You Need
The paper assumes you know these AI concepts. Let's build them up from scratch.
Diffusion Models
A diffusion modelAn AI that generates images/video by learning to reverse a noise-adding process. Training: add noise to real data. Inference: start from pure noise and progressively remove it to create new data. is the engine behind most modern image and video generation (DALL-E, Stable Diffusion, Sora, etc.).
Two phases:
- Forward process (training): Take real data, gradually add random noise until it becomes pure static.
- Reverse process (generation): Start from pure noise, remove noise step-by-step. Each step, the model predicts "what noise is here?" and subtracts it.
The number of noise-removal steps is called denoising stepsThe number of times the model looks at the noisy image and removes a bit of noise. More steps = better quality but slower. Typical range: 20-1000 steps.. More steps = better quality, but slower. Avatar V uses 24 steps after optimization (down from hundreds).
Deep Dive: What is "noise" mathematically?
"Noise" here means Gaussian noise - random values drawn from a bell curve (normal distribution). Each pixel gets a random value added to it. At step 0, the image is clean. At the final step, it's completely random static - no trace of the original image remains.
The model is trained to predict the noise that was added at each step. During generation, it predicts the noise in the current noisy image and subtracts it, getting slightly closer to a clean image each time.
Transformers & Attention
A TransformerThe dominant AI architecture since 2017. Used in ChatGPT, image generators, and now video generators. Its key innovation: the "attention" mechanism that lets every piece of input look at every other piece to understand context. is the architecture (blueprint) that powers ChatGPT, DALL-E, and now Avatar V. Its superpower is the attention mechanismA way for the model to decide "which parts of my input are relevant to the part I'm currently processing?" Each element computes a relevance score with every other element, then focuses on the most relevant ones..
Key terms the paper uses:
- DiT (Diffusion Transformer)A Transformer architecture specifically designed for diffusion models. Instead of a U-Net (the older approach), it uses Transformer blocks to process the noisy image/video. More scalable and powerful at large sizes. - A Transformer designed for diffusion. Avatar V's core architecture.
- Self-attentionEach element in the sequence looks at all other elements in the SAME sequence to gather context. "How does this video frame relate to other frames in this video?" - Elements in the same sequence look at each other.
- Cross-attentionElements from one sequence look at elements from a DIFFERENT sequence. Example: video frames "looking at" audio features to synchronize lip movements with speech sounds. - Elements from one input look at elements from a different input (e.g., video looking at audio).
- TokensThe basic units the Transformer processes. Text is split into word-pieces (tokens). Images are split into small patches (tokens). Video frames are split into space-time patches (tokens). Everything becomes a sequence of tokens. - The basic units the model works with. Text gets split into word-tokens. Images get split into patch-tokens. Video gets split into space-time patch-tokens.
Deep Dive: Why "quadratic cost" matters
In standard attention, every token looks at every other token. If you have N tokens, that's N x N comparisons - this is quadratic growth. Double the tokens = 4x the computation.
A reference video might have thousands of tokens. If you naively let all reference + generation tokens attend to each other, the cost explodes. This is the problem Avatar V's "Sparse Reference Attention" solves.
Deep Dive: KV Cache
In attention, each token produces three things: a Query (Q), a Key (K), and a Value (V).
- Query: "I'm looking for information about X"
- Key: "I contain information about Y"
- Value: "Here's my actual content"
The attention score is Q matched against K (like a search query hitting search results). High-scoring matches have their V content sent back.
KV Caching: If the reference video never changes between denoising steps, you can compute its Keys and Values once, cache them, and reuse them for all 24 steps. This is a huge speed win.
VAE (Variational Autoencoder)
A VAEVariational Autoencoder. Compresses high-resolution images/video into a smaller "latent" representation and can decompress back. Like JPEG compression but learned by AI, and much more powerful. compresses data into a smaller representation and decompresses it back.
Why this matters: Working with full 1080p video frames directly would be absurdly expensive computationally. Instead, the VAE compresses each frame into a tiny "latent" version, the diffusion model works in this compact space, and then the VAE decompresses the result back to full-size video.
Flow Matching
Flow matchingA training method for diffusion models where instead of learning to predict noise, the model learns to predict a "velocity" - the direction and speed to move from noise toward clean data along a straight line. More stable and efficient than older methods. is a modern alternative to the original diffusion training method (DDPMDenoising Diffusion Probabilistic Models. The original (2020) method for training diffusion models. Uses a specific noise schedule and learns to predict the noise added at each step. Flow matching is a newer, often more efficient alternative.).
The model learns a velocity fieldIn flow matching, instead of predicting noise, the model predicts the "velocity" (direction + magnitude) needed to move from the current noisy state toward the clean data. Think of it as predicting which way to go at each point. - at each point along the noise-to-data journey, it predicts which direction to go and how fast. Avatar V uses "rectified flow matching," which specifically encourages straight-line paths.
Embeddings
An embeddingA list of numbers (vector) that represents something complex (a face, a voice, a word) in a way that captures its meaning. Similar things have similar number patterns. Used to measure similarity and as input to AI models. is a list of numbers that represents something complex in a compact, meaningful way.
The paper mentions several types:
- Identity embedding: Numbers capturing someone's facial appearance
- Expression embedding: Numbers capturing facial expressions at a moment
- Speaker embedding: Numbers capturing someone's voice characteristics
- Text embedding: Numbers representing the meaning of a text prompt
- ArcFace embeddingA specific face recognition model (from 2019) that produces high-quality face embeddings. Used as a standard way to measure how similar two faces look. If cosine similarity between two ArcFace embeddings is high, the faces look alike. - A specific, well-known face embedding used for measuring identity similarity
How Avatar V Works
Avatar V has four major components working together:
Sparse Reference Attention
This is Avatar V's most important innovation. The core idea:
But there's a cost problem. With standard attention:
- Reference video tokens: let's say 5,000
- Generation video tokens: let's say 5,000
- Standard attention: every token looks at every other = 10,000 x 10,000 = 100 million comparisons
Sparse Reference Attention's trick:
- Generation tokens CAN look at reference tokens (they need identity info)
- Reference tokens only look at OTHER reference tokens (they don't need anything from the generation)
- This makes the cost linear in reference length instead of quadratic
Deep Dive: What "asymmetric" means here
The attention is asymmetric because the two groups of tokens have different attention rules:
- Reference tokens: Self-attention only (look at each other)
- Generation tokens: Attend to BOTH generation tokens AND reference tokens
This asymmetry is what makes it "sparse" - not all possible attention connections exist. The missing connections (reference looking at generation) aren't useful anyway, so removing them is free performance.
Motion Representation Stream
This component captures how a person moves, not just how they look.
It serves two roles simultaneously (called "closed-loop"):
- As a learning target: "Given this audio, predict how THIS specific person would move"
- As a conditioning signal: "Use these predicted motions to guide video generation"
By doing both, the model develops a unified understanding of each person's motion style.
Super-Resolution Refiner
The core model generates video at low resolution (for speed). The super-resolutionThe process of taking a low-resolution image/video and generating a higher-resolution version with added detail. Like "enhance" in movies, except it actually works because the AI has learned what realistic detail looks like. refiner upscales it to 1080p.
What makes it special: Unlike generic upscaling (which just makes pixels bigger), Avatar V's refiner has access to the same identity reference video. So when it's enhancing the face region, it can look at the reference to know exactly what your teeth, skin pores, and eye details should look like.
It also uses sparse temporal attentionInstead of each frame looking at ALL other frames during upscaling (expensive), each frame only looks at nearby frames. Since the base model already established smooth motion, the refiner only needs to add local detail, not global consistency. - since the low-resolution video already has smooth, consistent motion, the refiner only needs to look at nearby frames to add detail, not the entire video.
Voice Cloning Engine
From just ~10 seconds of audio, the voice cloning engine can reproduce someone's voice. It's built on an LLMLarge Language Model. A very large neural network trained on text (and sometimes audio) that can generate human-like text, translate, summarize, and more. ChatGPT is an LLM. Here, an LLM-like architecture is used for audio generation. backbone and treats speech generation as predicting a sequence of audio tokensDiscrete codes representing small chunks of audio. Just like text can be broken into word-tokens, audio can be broken into audio-tokens using a codec. The LLM predicts these tokens one by one to generate speech. - similar to how ChatGPT predicts the next word, but for sound.
How It Learns (Training Pipeline)
Avatar V doesn't learn everything at once. It follows a 5-stage curriculum, like going from elementary school through grad school:
Stage 1: Text-to-Video Pre-Training
The model first learns general video understanding from millions of text-video pairs. "A dog runs across a field" β video of a dog running. This teaches:
- How objects move through space
- How lighting and physics work
- Basic scene composition
Training uses progressive scaling: start with tiny, short videos, gradually increase resolution and duration. Like teaching a child to draw stick figures before oil paintings.
Optimizer: MuonA newer optimizer (2025) designed for training large neural networks. An optimizer is the algorithm that adjusts the model's numbers (weights) during training to make it better. Muon is more efficient than the widely-used Adam optimizer for large models. for most parameters, AdamWA widely-used optimization algorithm. "Adam" adapts learning rates for each parameter; "W" adds weight decay (a regularization technique that prevents parameters from growing too large). The standard choice for training Transformers. for embeddings.
Stage 2: Audio-to-Video Pre-Training
Now the model learns to synchronize lips with speech. Given a face image + audio track, generate a video where the person speaks those words. This stage adds the audio cross-attention modules that connect sound features to visual generation.
Trained on a huge corpus of talking-head videos covering diverse speakers, languages, and styles.
Stage 3: Personality SFT (Supervised Fine-Tuning)
SFTSupervised Fine-Tuning. Taking a pre-trained model and training it further on a specific task with labeled examples. Like a medical student (pre-trained on general medicine) specializing in cardiology (fine-tuned on heart cases). = taking the general model and specializing it for identity-preservation.
The training data is carefully constructed: each example has a target video (what to generate) paired with reference clips of the same person in different scenes. This forces the model to extract identity features that are independent of the background.
This is where Sparse Reference Attention and the motion representation stream are activated.
Human-aware auxiliary losses are added here - extra training signals beyond pixel-level accuracy that specifically target face quality, lip sync, identity similarity, and motion fidelity.
Stage 4: Distillation (Making It 10x Faster)
DistillationTraining a smaller/faster "student" model to mimic a larger/slower "teacher" model. The student learns to produce similar outputs in fewer steps. Like a student learning shortcuts from an experienced teacher. compresses the slow, high-quality model into a fast one.
Phase 1: CFG Distillation
Classifier-Free Guidance (CFG)A technique where the model runs twice per step: once with the conditioning (e.g., "a cat") and once without. The difference between the two is amplified to make the output more closely match the condition. Problem: doubles (or more) the computation per step. normally requires multiple forward passes per step (one with the condition, one without). CFG distillation teaches the model to internalize this, needing only ONE pass.
Phase 2: DMD (Distribution Matching Distillation)
This reduces the number of denoising steps. Uses a three-model setup:
- Student: Learns to generate in fewer steps
- Fake teacher: Models what the student's outputs look like (trainable)
- Real teacher: The original slow model (frozen - doesn't change)
The student learns to make its output distribution match the real teacher's, even though it uses far fewer steps.
Combined result: 10x+ faster inference.
Stage 5: RLHF (Learning from Human Preferences)
RLHFReinforcement Learning from Human Feedback. Humans rate outputs (or compare pairs), and the model learns to produce outputs that score higher. This is how ChatGPT was trained to be helpful and harmless. Here, it's used to make videos look more natural to human eyes. = letting humans judge the outputs and training the model to score higher.
Two approaches are combined:
- GRPOGroup Relative Policy Optimization. An RL algorithm that generates a group of outputs, scores them all, then improves the model by learning from the relative rankings within the group. More stable than traditional policy gradient methods. (Group Relative Policy Optimization): Generate multiple videos, score them with reward functions (identity similarity, motion naturalness, visual quality), improve based on relative rankings.
- DPODirect Preference Optimization. Instead of training a separate reward model, DPO directly learns from human preference pairs ("I prefer video A over video B"). Simpler than RLHF with reward models, and often equally effective. (Direct Preference Optimization): Learn directly from human-annotated pairs ("this video is better than that one").
KL regularizationA mathematical constraint that prevents the model from changing too much during RLHF. Without it, the model might "hack" the reward function by producing weird outputs that score high but look terrible. KL divergence measures how different the new model is from the old one. prevents the model from drifting too far from its pre-RLHF capabilities.
How It Generates Video (Inference)
InferenceThe process of using a trained model to generate new outputs. Training = learning. Inference = doing. When you type a prompt into ChatGPT, the response is generated during "inference." is when the trained model actually generates a video. Here's how:
Chunk-Based Generation
Avatar V generates video in chunks of ~6.4 seconds each. For longer videos, chunks are stitched together:
- First chunk: Uses the reference video directly to establish identity
- Subsequent chunks: Use the last frames of the previous chunk as a bridge to maintain continuity
- A global appearance anchor from the first chunk keeps identity consistent across all chunks
Speed Optimizations
The paper describes several clever tricks to make generation fast enough for production:
- Context caching: The reference video never changes between denoising steps, so compute it once and reuse it for all 24 steps
- Sequence parallelism: Spread the long token sequence across 8 GPUs using Ulysses Sequence ParallelismA method for splitting a long sequence across multiple GPUs. Each GPU processes a portion of the sequence, with all-to-all communication when attention needs to see tokens on other GPUs. Named after the long novel by James Joyce.
- AI-written GPU code: They used an LLM to write optimized low-level GPU programs that fuse many small operations into single large ones, reducing overhead by 3x
- Overlapped communication: GPUs transfer data to each other at the same time as they compute, hiding the communication cost
- GPU clock locking: In distributed inference, the slowest GPU determines speed. They lock all GPUs to a stable frequency to eliminate variance.
Data: Fuel for the Model
Avatar V was trained on a massive dataset: 100M+ clips curated from 50M raw videos.
The Data Pipeline
Raw videos go through a multi-stage filtering cascade:
Cross-Clip Identity Connectivity
A critical data innovation: they build a graph connecting video clips of the same person across different scenes. Two clips are linked if:
- Same person (high face similarity)
- Different scene (low background similarity)
- Long enough to capture motion patterns
This lets the model learn "this is the same person even though the background, lighting, and camera angle are completely different" - essential for identity that doesn't depend on the scene.
Infrastructure at Scale
Avatar V runs on 5,000+ GPUs across multiple cloud providers. Two key infrastructure pieces:
HELIOS
A unified platform that makes GPUs from 5+ providers and 10+ regions act as a single pool. Key ideas:
- Cell-based architecture: GPUs organized into standardized isolated groups called "cells." Problems in one cell don't spread to others.
- Priority-aware scheduling: User-facing video generation gets highest priority. Training gets large stable blocks. Data processing fills the gaps.
- Improved GPU utilization by 15% and reduced wasted GPU time by ~20%.
Custom Data Processing Engine
They outgrew RayA popular open-source framework for distributed computing in Python. Good for moderate scale, but its centralized coordination (Global Control Store) becomes a bottleneck at very large scale (2000+ nodes). (a popular distributed computing framework) at 2,000+ nodes. Built a replacement using a different coordination model:
How Good Is It?
Automated Metrics
Compared against Kling O3 Pro, Veo 3.1, OmniHuman 1.5, and Seedance 2.0:
| Metric | What It Measures | Avatar V | Best Competitor |
|---|---|---|---|
| SyncNetA model that measures how well lip movements match audio. Higher confidence = better lip sync. A standard benchmark in talking-head research. Confidence | Lip-audio sync quality | 8.97 | 8.86 (Seedance) |
| Face Similarity | Identity preservation | 0.840 | 0.838 (Kling) |
| Q-AlignA vision-language model that scores image/video quality on a scale calibrated to human opinion. Essentially an AI that rates visual quality like a human would. | Visual quality | 4.85 | 4.95 (Veo 3.1*) |
*Veo 3.1 wins on visual quality but severely sacrifices identity (Face Sim = 0.714). Over-sharpening inflates its quality score.
Human Evaluation
Avatar V scored highest on all 6 dimensions rated by trained human annotators (5-point scale):
- Identity: 4.98/5 (near perfect)
- Lip Sync: 4.69/5
- Motion Naturalness: 4.48/5
- Visual Quality: 4.78/5
The Turing Test
In a "is it real?" test, human annotators correctly identified the real video 77.8% of the time. But in 61% of test cases, at least one of three annotators was fooled by the AI-generated video.
Final Comprehension Quiz
Why This Paper Matters
For Video Production Teams
Avatar V represents a shift from "generic AI video" to "personalized AI video at scale." Previous systems could generate videos of a generic person talking, but couldn't faithfully reproduce a specific person's talking rhythm, micro-expressions, and gestural tendencies. For companies creating personalized video content — training videos, marketing, customer support, localization — this means AI-generated avatars that are actually recognizable as the real person, not just visually similar.
For the Research Community
The paper introduces several techniques with broad applicability beyond avatars:
- Sparse Reference Attention solves the quadratic scaling problem for conditioning on long reference contexts — relevant for any video generation system that conditions on reference material
- The motion representation stream demonstrates that identity and motion can be disentangled and transferred separately, opening the door to motion style transfer across different identities
- The five-stage progressive training pipeline provides a practical template for training complex generative systems — starting broad (text-to-video), then specializing (lip sync, identity, speed, quality) in stages rather than trying to learn everything at once
The Bigger Picture
Avatar V points toward a future where video communication is no longer bottlenecked by the physical availability of the speaker. A CEO could record a 30-second reference video and generate personalized messages to thousands of employees in their own talking style. Educational content could be delivered by an instructor's avatar speaking any of 50+ languages while preserving their teaching mannerisms. The technology raises important questions about consent, deepfakes, and authenticity — but the production-quality bar it sets (1080p, unlimited duration, state-of-the-art fidelity) means these conversations are no longer hypothetical.