Avatar V: The Paper, Explained

A beginner-friendly guide to HeyGen's Avatar V technical report. Every AI term is defined. Every concept is grounded in analogy.

Paper by HeyGen Research (2026) · Explainer published April 13, 2026

made withHyperFrames Avatar V generates personalized talking videos from just 10 seconds of reference β€” preserving identity, motion style, and voice across new scripts, languages, and unlimited duration.

The Big Picture

Avatar V is a system that takes a short video of a real person and generates new, high-quality talking-head videos of that person saying anything you want. The generated person doesn't just look like the original - they move, talk, and gesture like them too.

Imagine you film yourself for 30 seconds. Avatar V studies that clip and can then produce a video of "you" giving a presentation in a completely different setting - preserving your facial structure, skin texture, the way you move your mouth when you talk, your hand gestures, and even how you blink.

The Three Problems It Solves

Before Avatar V, existing systems had three big weaknesses:

  1. Shallow identity: They used a single photo as reference. One photo can't capture how you look from different angles, in different lighting, or with different expressions. So generated videos would "drift" - the person would start looking less like you over time.
  2. No personality: They could copy your face but not your behavior. Everyone's generated videos looked the same in terms of motion - generic head bobs and lip movements.
  3. Blurry faces: The AI spreads its learning effort evenly across the entire video frame. But the face (especially lips, teeth, and eyes) is what humans actually care about, and it's a tiny portion of the frame. So faces came out blurry or wrong.
Avatar V's key innovation is using a video (not a single image) as the identity reference, and letting the AI look at the full reference video at every step of generation. More reference = richer identity information.

See It In Action

Before diving into the technical details, see what Avatar V actually produces. These demos are from the official research page.

Reference Video vs Generated Output

Given a short reference video of someone (left), Avatar V generates a new video of that person in a different scene, preserving their identity and talking style (right).

🔊 Turn on sound to hear the voice cloning — Avatar V reproduces the speaker's vocal tone, speech rhythm, and accent from just ~10 seconds of audio (Voice Cloning section)
Reference Video (input)
Avatar V Output
Reference Video (input)
Avatar V Output

Comparison with Other Models

Avatar V is evaluated against Kling O3 Pro, Veo 3.1, OmniHuman 1.5, and Seedance 2.0. These grid comparisons show all models generating from the same inputs.

Side-by-side comparison: Avatar V vs 4 competing systems on the same inputs (source)

Can You Tell Real from AI? (Turing Test)

In the paper's Turing test, human annotators were shown a pair of videos — one real, one generated — and asked to pick the real one. Click the video you think is real:

Watch both videos first, then click the button below the one you think is real.

Video A
Video B
Video A
Video B

In the paper's evaluation, 61% of test cases fooled at least one trained annotator. More Turing test pairs are on the project page.

Now that you've seen what Avatar V can do, let's understand how it works. The sections below explain every concept the paper assumes you already know, then walk through each component of the system.

Background Concepts You Need

The paper assumes you know these AI concepts. Let's build them up from scratch.

Diffusion Models

made withHyperFrames The diffusion process: noise is progressively removed to reveal clean data

A diffusion modelAn AI that generates images/video by learning to reverse a noise-adding process. Training: add noise to real data. Inference: start from pure noise and progressively remove it to create new data. is the engine behind most modern image and video generation (DALL-E, Stable Diffusion, Sora, etc.).

Imagine you have a beautiful painting. You slowly sprinkle sand over it - grain by grain - until it's completely buried and all you see is a pile of sand. A diffusion model learns to do the reverse: given a pile of sand (random noise), it learns to carefully remove grains to reveal a painting underneath. During training, you show it millions of examples of "painting + sand at various stages" so it learns what to remove at each step.

Two phases:

The number of noise-removal steps is called denoising stepsThe number of times the model looks at the noisy image and removes a bit of noise. More steps = better quality but slower. Typical range: 20-1000 steps.. More steps = better quality, but slower. Avatar V uses 24 steps after optimization (down from hundreds).

Deep Dive: What is "noise" mathematically?

"Noise" here means Gaussian noise - random values drawn from a bell curve (normal distribution). Each pixel gets a random value added to it. At step 0, the image is clean. At the final step, it's completely random static - no trace of the original image remains.

The model is trained to predict the noise that was added at each step. During generation, it predicts the noise in the current noisy image and subtracts it, getting slightly closer to a clean image each time.

Transformers & Attention

A TransformerThe dominant AI architecture since 2017. Used in ChatGPT, image generators, and now video generators. Its key innovation: the "attention" mechanism that lets every piece of input look at every other piece to understand context. is the architecture (blueprint) that powers ChatGPT, DALL-E, and now Avatar V. Its superpower is the attention mechanismA way for the model to decide "which parts of my input are relevant to the part I'm currently processing?" Each element computes a relevance score with every other element, then focuses on the most relevant ones..

You're reading a book, and you come across the word "she." Your brain automatically looks back to figure out who "she" refers to - maybe a character mentioned two paragraphs ago. Attention is the AI version of this: for every piece of data it processes, the model looks at all other pieces and asks "how relevant is this to what I'm working on right now?"

Key terms the paper uses:

Deep Dive: Why "quadratic cost" matters

In standard attention, every token looks at every other token. If you have N tokens, that's N x N comparisons - this is quadratic growth. Double the tokens = 4x the computation.

A reference video might have thousands of tokens. If you naively let all reference + generation tokens attend to each other, the cost explodes. This is the problem Avatar V's "Sparse Reference Attention" solves.

Deep Dive: KV Cache

In attention, each token produces three things: a Query (Q), a Key (K), and a Value (V).

  • Query: "I'm looking for information about X"
  • Key: "I contain information about Y"
  • Value: "Here's my actual content"

The attention score is Q matched against K (like a search query hitting search results). High-scoring matches have their V content sent back.

KV Caching: If the reference video never changes between denoising steps, you can compute its Keys and Values once, cache them, and reuse them for all 24 steps. This is a huge speed win.

VAE (Variational Autoencoder)

made withHyperFrames The VAE compresses video frames into a tiny latent space β€” where the diffusion model actually works

A VAEVariational Autoencoder. Compresses high-resolution images/video into a smaller "latent" representation and can decompress back. Like JPEG compression but learned by AI, and much more powerful. compresses data into a smaller representation and decompresses it back.

Think of JPEG compression: a 10MB photo becomes a 500KB file, then gets decompressed back to an image that looks almost identical. A VAE does this but with AI - it learns the most efficient way to compress visual data. The compressed version is called the latent spaceThe compressed, lower-dimensional space where the VAE stores its compact representation of images/video. The diffusion model works in this space for efficiency, then the VAE decoder converts results back to pixels..

Why this matters: Working with full 1080p video frames directly would be absurdly expensive computationally. Instead, the VAE compresses each frame into a tiny "latent" version, the diffusion model works in this compact space, and then the VAE decompresses the result back to full-size video.

Flow Matching

made withHyperFrames DDPM takes 1000 curvy steps; flow matching takes 24 straight ones

Flow matchingA training method for diffusion models where instead of learning to predict noise, the model learns to predict a "velocity" - the direction and speed to move from noise toward clean data along a straight line. More stable and efficient than older methods. is a modern alternative to the original diffusion training method (DDPMDenoising Diffusion Probabilistic Models. The original (2020) method for training diffusion models. Uses a specific noise schedule and learns to predict the noise added at each step. Flow matching is a newer, often more efficient alternative.).

Original diffusion is like navigating a maze from noise to image - the path is curvy and you need many steps. Flow matching draws a straight line from noise to image and teaches the model to follow it. Straighter path = fewer steps needed = faster generation.

The model learns a velocity fieldIn flow matching, instead of predicting noise, the model predicts the "velocity" (direction + magnitude) needed to move from the current noisy state toward the clean data. Think of it as predicting which way to go at each point. - at each point along the noise-to-data journey, it predicts which direction to go and how fast. Avatar V uses "rectified flow matching," which specifically encourages straight-line paths.

Embeddings

An embeddingA list of numbers (vector) that represents something complex (a face, a voice, a word) in a way that captures its meaning. Similar things have similar number patterns. Used to measure similarity and as input to AI models. is a list of numbers that represents something complex in a compact, meaningful way.

Imagine describing a person's face using just 512 numbers. Number 1 might relate to face shape, number 2 to skin tone, number 37 to nose width, etc. (The model figures out what each number means on its own.) Two people who look similar will have similar number lists. This is a face embedding.

The paper mentions several types:

How Avatar V Works

made withHyperFrames Data flows through four stages: Image Engine β†’ VideoRef DiT β†’ Super-Res β†’ VAE Decode

Avatar V has four major components working together:

Reference Video
Audio Track
Text Prompt
🎬
Image Engine
Generates a scene image preserving your face
🧠
VideoRef DiT
Core video generator with Sparse Reference Attention
🔍
Super-Resolution Refiner
Upscales to 1080p with identity awareness
📹
Streaming VAE Decode
Converts from latent space back to pixels
Your Avatar Video

Sparse Reference Attention

made withHyperFrames Standard quadratic attention vs Avatar V's sparse linear attention

This is Avatar V's most important innovation. The core idea:

Instead of squeezing your identity into a small set of numbers (which loses detail), Avatar V keeps the FULL reference video as tokens and lets the generated video "look at" all of them whenever it needs identity information.

But there's a cost problem. With standard attention:

Sparse Reference Attention's trick:

Imagine a class of students (generation tokens) learning from a set of textbooks (reference tokens). In "standard attention," every student reads every textbook AND every textbook somehow reads every student's notes - wasteful. In Sparse Reference Attention, students read the textbooks, but textbooks don't read student notes. Half the work, same learning.
Deep Dive: What "asymmetric" means here

The attention is asymmetric because the two groups of tokens have different attention rules:

  • Reference tokens: Self-attention only (look at each other)
  • Generation tokens: Attend to BOTH generation tokens AND reference tokens

This asymmetry is what makes it "sparse" - not all possible attention connections exist. The missing connections (reference looking at generation) aren't useful anyway, so removing them is free performance.

Motion Representation Stream

This component captures how a person moves, not just how they look.

Everyone has a unique "motion fingerprint." Some people barely move their head when talking. Others are very animated. Some smile with their whole face; others just slightly raise one corner of their mouth. The motion representation stream learns these individual patterns.

It serves two roles simultaneously (called "closed-loop"):

  1. As a learning target: "Given this audio, predict how THIS specific person would move"
  2. As a conditioning signal: "Use these predicted motions to guide video generation"

By doing both, the model develops a unified understanding of each person's motion style.

Super-Resolution Refiner

made withHyperFrames Generic upscaling just makes bigger pixels β€” Avatar V's refiner uses the reference to reconstruct real detail

The core model generates video at low resolution (for speed). The super-resolutionThe process of taking a low-resolution image/video and generating a higher-resolution version with added detail. Like "enhance" in movies, except it actually works because the AI has learned what realistic detail looks like. refiner upscales it to 1080p.

What makes it special: Unlike generic upscaling (which just makes pixels bigger), Avatar V's refiner has access to the same identity reference video. So when it's enhancing the face region, it can look at the reference to know exactly what your teeth, skin pores, and eye details should look like.

It also uses sparse temporal attentionInstead of each frame looking at ALL other frames during upscaling (expensive), each frame only looks at nearby frames. Since the base model already established smooth motion, the refiner only needs to add local detail, not global consistency. - since the low-resolution video already has smooth, consistent motion, the refiner only needs to look at nearby frames to add detail, not the entire video.

Voice Cloning Engine

From just ~10 seconds of audio, the voice cloning engine can reproduce someone's voice. It's built on an LLMLarge Language Model. A very large neural network trained on text (and sometimes audio) that can generate human-like text, translate, summarize, and more. ChatGPT is an LLM. Here, an LLM-like architecture is used for audio generation. backbone and treats speech generation as predicting a sequence of audio tokensDiscrete codes representing small chunks of audio. Just like text can be broken into word-tokens, audio can be broken into audio-tokens using a codec. The LLM predicts these tokens one by one to generate speech. - similar to how ChatGPT predicts the next word, but for sound.

How It Learns (Training Pipeline)

made withHyperFrames Five training stages, each building on the last β€” from general video understanding to human-preferred quality

Avatar V doesn't learn everything at once. It follows a 5-stage curriculum, like going from elementary school through grad school:

1
Text-to-Video
"Learn what video is"
2
Audio-to-Video
"Learn how lips sync to speech"
3
Personality Fine-Tuning
"Learn to copy someone's identity"
4
Distillation
"Learn to do it 10x faster"
5
Human Feedback
"Learn what humans prefer"
Stage 1: Text-to-Video Pre-Training

The model first learns general video understanding from millions of text-video pairs. "A dog runs across a field" β†’ video of a dog running. This teaches:

  • How objects move through space
  • How lighting and physics work
  • Basic scene composition

Training uses progressive scaling: start with tiny, short videos, gradually increase resolution and duration. Like teaching a child to draw stick figures before oil paintings.

Optimizer: MuonA newer optimizer (2025) designed for training large neural networks. An optimizer is the algorithm that adjusts the model's numbers (weights) during training to make it better. Muon is more efficient than the widely-used Adam optimizer for large models. for most parameters, AdamWA widely-used optimization algorithm. "Adam" adapts learning rates for each parameter; "W" adds weight decay (a regularization technique that prevents parameters from growing too large). The standard choice for training Transformers. for embeddings.

Stage 2: Audio-to-Video Pre-Training

Now the model learns to synchronize lips with speech. Given a face image + audio track, generate a video where the person speaks those words. This stage adds the audio cross-attention modules that connect sound features to visual generation.

Trained on a huge corpus of talking-head videos covering diverse speakers, languages, and styles.

Stage 3: Personality SFT (Supervised Fine-Tuning)

SFTSupervised Fine-Tuning. Taking a pre-trained model and training it further on a specific task with labeled examples. Like a medical student (pre-trained on general medicine) specializing in cardiology (fine-tuned on heart cases). = taking the general model and specializing it for identity-preservation.

The training data is carefully constructed: each example has a target video (what to generate) paired with reference clips of the same person in different scenes. This forces the model to extract identity features that are independent of the background.

This is where Sparse Reference Attention and the motion representation stream are activated.

Human-aware auxiliary losses are added here - extra training signals beyond pixel-level accuracy that specifically target face quality, lip sync, identity similarity, and motion fidelity.

Stage 4: Distillation (Making It 10x Faster)

DistillationTraining a smaller/faster "student" model to mimic a larger/slower "teacher" model. The student learns to produce similar outputs in fewer steps. Like a student learning shortcuts from an experienced teacher. compresses the slow, high-quality model into a fast one.

Phase 1: CFG Distillation

Classifier-Free Guidance (CFG)A technique where the model runs twice per step: once with the conditioning (e.g., "a cat") and once without. The difference between the two is amplified to make the output more closely match the condition. Problem: doubles (or more) the computation per step. normally requires multiple forward passes per step (one with the condition, one without). CFG distillation teaches the model to internalize this, needing only ONE pass.

Phase 2: DMD (Distribution Matching Distillation)

This reduces the number of denoising steps. Uses a three-model setup:

  • Student: Learns to generate in fewer steps
  • Fake teacher: Models what the student's outputs look like (trainable)
  • Real teacher: The original slow model (frozen - doesn't change)

The student learns to make its output distribution match the real teacher's, even though it uses far fewer steps.

Combined result: 10x+ faster inference.

Stage 5: RLHF (Learning from Human Preferences)

RLHFReinforcement Learning from Human Feedback. Humans rate outputs (or compare pairs), and the model learns to produce outputs that score higher. This is how ChatGPT was trained to be helpful and harmless. Here, it's used to make videos look more natural to human eyes. = letting humans judge the outputs and training the model to score higher.

Two approaches are combined:

  • GRPOGroup Relative Policy Optimization. An RL algorithm that generates a group of outputs, scores them all, then improves the model by learning from the relative rankings within the group. More stable than traditional policy gradient methods. (Group Relative Policy Optimization): Generate multiple videos, score them with reward functions (identity similarity, motion naturalness, visual quality), improve based on relative rankings.
  • DPODirect Preference Optimization. Instead of training a separate reward model, DPO directly learns from human preference pairs ("I prefer video A over video B"). Simpler than RLHF with reward models, and often equally effective. (Direct Preference Optimization): Learn directly from human-annotated pairs ("this video is better than that one").

KL regularizationA mathematical constraint that prevents the model from changing too much during RLHF. Without it, the model might "hack" the reward function by producing weird outputs that score high but look terrible. KL divergence measures how different the new model is from the old one. prevents the model from drifting too far from its pre-RLHF capabilities.

How It Generates Video (Inference)

InferenceThe process of using a trained model to generate new outputs. Training = learning. Inference = doing. When you type a prompt into ChatGPT, the response is generated during "inference." is when the trained model actually generates a video. Here's how:

Chunk-Based Generation

made withHyperFrames Overlapping chunks enable unlimited-duration video generation

Avatar V generates video in chunks of ~6.4 seconds each. For longer videos, chunks are stitched together:

  1. First chunk: Uses the reference video directly to establish identity
  2. Subsequent chunks: Use the last frames of the previous chunk as a bridge to maintain continuity
  3. A global appearance anchor from the first chunk keeps identity consistent across all chunks
Like writing a long essay in paragraphs. Each new paragraph starts by referencing the end of the previous one to maintain flow. And you keep a photo of the main character on your desk to stay consistent.

Speed Optimizations

The paper describes several clever tricks to make generation fast enough for production:

Data: Fuel for the Model

Avatar V was trained on a massive dataset: 100M+ clips curated from 50M raw videos.

The Data Pipeline

Raw videos go through a multi-stage filtering cascade:

Normalize resolution (640px) and frame rate (25 frames per second)
Reject static or choppy content by measuring how much changes between frames
Detect humans and faces using AI detection models
Score visual quality using an AI model trained to match human quality judgments
Smart clipping — automatically find the best start and end points for each clip
Scene-cut detection and content filtering (reject screencasts, games, static photos)
Categorize across 15 dimensions and deduplicate by finding near-identical clips and removing copies

Cross-Clip Identity Connectivity

A critical data innovation: they build a graph connecting video clips of the same person across different scenes. Two clips are linked if:

This lets the model learn "this is the same person even though the background, lighting, and camera angle are completely different" - essential for identity that doesn't depend on the scene.

Infrastructure at Scale

Avatar V runs on 5,000+ GPUs across multiple cloud providers. Two key infrastructure pieces:

HELIOS

A unified platform that makes GPUs from 5+ providers and 10+ regions act as a single pool. Key ideas:

Custom Data Processing Engine

They outgrew RayA popular open-source framework for distributed computing in Python. Good for moderate scale, but its centralized coordination (Global Control Store) becomes a bottleneck at very large scale (2000+ nodes). (a popular distributed computing framework) at 2,000+ nodes. Built a replacement using a different coordination model:

Instead of the boss calling each worker and saying "do task X" (imperative/command model), the boss posts a bulletin board saying "I need X done" and workers independently check the board and do what's needed (declarative model). If a worker crashes and restarts, they just check the board again. No messages get lost.

How Good Is It?

Automated Metrics

Compared against Kling O3 Pro, Veo 3.1, OmniHuman 1.5, and Seedance 2.0:

Metric What It Measures Avatar V Best Competitor
SyncNetA model that measures how well lip movements match audio. Higher confidence = better lip sync. A standard benchmark in talking-head research. Confidence Lip-audio sync quality 8.97 8.86 (Seedance)
Face Similarity Identity preservation 0.840 0.838 (Kling)
Q-AlignA vision-language model that scores image/video quality on a scale calibrated to human opinion. Essentially an AI that rates visual quality like a human would. Visual quality 4.85 4.95 (Veo 3.1*)

*Veo 3.1 wins on visual quality but severely sacrifices identity (Face Sim = 0.714). Over-sharpening inflates its quality score.

Human Evaluation

Avatar V scored highest on all 6 dimensions rated by trained human annotators (5-point scale):

The Turing Test

In a "is it real?" test, human annotators correctly identified the real video 77.8% of the time. But in 61% of test cases, at least one of three annotators was fooled by the AI-generated video.

Quick Check: Why does Avatar V use a video reference instead of a single image?

Final Comprehension Quiz

What does "Sparse Reference Attention" solve?
Why does the training pipeline have 5 stages instead of training everything at once?
What is "distillation" in the context of this paper?
What does RLHF do for Avatar V?
Why did HeyGen replace Ray with a custom data processing engine?

Why This Paper Matters

For Video Production Teams

Avatar V represents a shift from "generic AI video" to "personalized AI video at scale." Previous systems could generate videos of a generic person talking, but couldn't faithfully reproduce a specific person's talking rhythm, micro-expressions, and gestural tendencies. For companies creating personalized video content — training videos, marketing, customer support, localization — this means AI-generated avatars that are actually recognizable as the real person, not just visually similar.

For the Research Community

The paper introduces several techniques with broad applicability beyond avatars:

The Bigger Picture

Avatar V points toward a future where video communication is no longer bottlenecked by the physical availability of the speaker. A CEO could record a 30-second reference video and generate personalized messages to thousands of employees in their own talking style. Educational content could be delivered by an instructor's avatar speaking any of 50+ languages while preserving their teaching mannerisms. The technology raises important questions about consent, deepfakes, and authenticity — but the production-quality bar it sets (1080p, unlimited duration, state-of-the-art fidelity) means these conversations are no longer hypothetical.