Avatar V: The Paper, Explained

A beginner-friendly guide to HeyGen's Avatar V technical report. Every AI term is defined. Every concept is grounded in analogy.

Paper by HeyGen Research (2026) · Explainer published April 13, 2026

made withHyperFrames Avatar V generates personalized talking videos from just 10 seconds of reference — preserving identity, motion style, and voice across new scripts, languages, and unlimited duration.

The Big Picture

Avatar V is a system that takes a short video of a real person and generates new, high-quality talking-head videos of that person saying anything you want. The generated person doesn't just look like the original - they move, talk, and gesture like them too.

Imagine you film yourself for 30 seconds. Avatar V studies that clip and can then produce a video of "you" giving a presentation in a completely different setting - preserving your facial structure, skin texture, the way you move your mouth when you talk, your hand gestures, and even how you blink.

The Three Problems It Solves

Before Avatar V, existing systems had three big weaknesses:

Shallow identity: They used a single photo as reference. One photo can't capture how you look from different angles, in different lighting, or with different expressions. So generated videos would "drift" - the person would start looking less like you over time.
No personality: They could copy your face but not your behavior. Everyone's generated videos looked the same in terms of motion - generic head bobs and lip movements.
Blurry faces: The AI spreads its learning effort evenly across the entire video frame. But the face (especially lips, teeth, and eyes) is what humans actually care about, and it's a tiny portion of the frame. So faces came out blurry or wrong.

Avatar V's key innovation is using a video (not a single image) as the identity reference, and letting the AI look at the full reference video at every step of generation. More reference = richer identity information.

See It In Action

Before diving into the technical details, see what Avatar V actually produces. These demos are from the official research page.

Reference Video vs Generated Output

Given a short reference video of someone (left), Avatar V generates a new video of that person in a different scene, preserving their identity and talking style (right).

🔊 Turn on sound to hear the voice cloning — Avatar V reproduces the speaker's vocal tone, speech rhythm, and accent from just ~10 seconds of audio (Voice Cloning section)

Reference Video (input)

Avatar V Output

Reference Video (input)

Avatar V Output

Comparison with Other Models

Avatar V is evaluated against Kling O3 Pro, Veo 3.1, OmniHuman 1.5, and Seedance 2.0. These grid comparisons show all models generating from the same inputs.

Side-by-side comparison: Avatar V vs 4 competing systems on the same inputs (source)

Can You Tell Real from AI? (Turing Test)

In the paper's Turing test, human annotators were shown a pair of videos — one real, one generated — and asked to pick the real one. Click the video you think is real:

Watch both videos first, then click the button below the one you think is real.

Video A

Video B

Video A

Video B

In the paper's evaluation, 61% of test cases fooled at least one trained annotator. More Turing test pairs are on the project page.

Now that you've seen what Avatar V can do, let's understand how it works. The sections below explain every concept the paper assumes you already know, then walk through each component of the system.

Background Concepts You Need

The paper assumes you know these AI concepts. Let's build them up from scratch.

Diffusion Models

made withHyperFrames The diffusion process: noise is progressively removed to reveal clean data

A diffusion modelAn AI that generates images/video by learning to reverse a noise-adding process. Training: add noise to real data. Inference: start from pure noise and progressively remove it to create new data. is the engine behind most modern image and video generation (DALL-E, Stable Diffusion, Sora, etc.).

Imagine you have a beautiful painting. You slowly sprinkle sand over it - grain by grain - until it's completely buried and all you see is a pile of sand. A diffusion model learns to do the reverse: given a pile of sand (random noise), it learns to carefully remove grains to reveal a painting underneath. During training, you show it millions of examples of "painting + sand at various stages" so it learns what to remove at each step.

Two phases:

Forward process (training): Take real data, gradually add random noise until it becomes pure static.
Reverse process (generation): Start from pure noise, remove noise step-by-step. Each step, the model predicts "what noise is here?" and subtracts it.

The number of noise-removal steps is called denoising stepsThe number of times the model looks at the noisy image and removes a bit of noise. More steps = better quality but slower. Typical range: 20-1000 steps.. More steps = better quality, but slower. Avatar V uses 24 steps after optimization (down from hundreds).

Deep Dive: What is "noise" mathematically?

"Noise" here means Gaussian noise - random values drawn from a bell curve (normal distribution). Each pixel gets a random value added to it. At step 0, the image is clean. At the final step, it's completely random static - no trace of the original image remains.

The model is trained to predict the noise that was added at each step. During generation, it predicts the noise in the current noisy image and subtracts it, getting slightly closer to a clean image each time.

Transformers & Attention

A TransformerThe dominant AI architecture since 2017. Used in ChatGPT, image generators, and now video generators. Its key innovation: the "attention" mechanism that lets every piece of input look at every other piece to understand context. is the architecture (blueprint) that powers ChatGPT, DALL-E, and now Avatar V. Its superpower is the attention mechanismA way for the model to decide "which parts of my input are relevant to the part I'm currently processing?" Each element computes a relevance score with every other element, then focuses on the most relevant ones..

You're reading a book, and you come across the word "she." Your brain automatically looks back to figure out who "she" refers to - maybe a character mentioned two paragraphs ago. Attention is the AI version of this: for every piece of data it processes, the model looks at all other pieces and asks "how relevant is this to what I'm working on right now?"

Key terms the paper uses:

DiT (Diffusion Transformer)A Transformer architecture specifically designed for diffusion models. Instead of a U-Net (the older approach), it uses Transformer blocks to process the noisy image/video. More scalable and powerful at large sizes. - A Transformer designed for diffusion. Avatar V's core architecture.
Self-attentionEach element in the sequence looks at all other elements in the SAME sequence to gather context. "How does this video frame relate to other frames in this video?" - Elements in the same sequence look at each other.
Cross-attentionElements from one sequence look at elements from a DIFFERENT sequence. Example: video frames "looking at" audio features to synchronize lip movements with speech sounds. - Elements from one input look at elements from a different input (e.g., video looking at audio).
TokensThe basic units the Transformer processes. Text is split into word-pieces (tokens). Images are split into small patches (tokens). Video frames are split into space-time patches (tokens). Everything becomes a sequence of tokens. - The basic units the model works with. Text gets split into word-tokens. Images get split into patch-tokens. Video gets split into space-time patch-tokens.

Deep Dive: Why "quadratic cost" matters

In standard attention, every token looks at every other token. If you have N tokens, that's N x N comparisons - this is quadratic growth. Double the tokens = 4x the computation.

A reference video might have thousands of tokens. If you naively let all reference + generation tokens attend to each other, the cost explodes. This is the problem Avatar V's "Sparse Reference Attention" solves.

Deep Dive: KV Cache

In attention, each token produces three things: a Query (Q), a Key (K), and a Value (V).

Query: "I'm looking for information about X"
Key: "I contain information about Y"
Value: "Here's my actual content"

The attention score is Q matched against K (like a search query hitting search results). High-scoring matches have their V content sent back.

KV Caching: If the reference video never changes between denoising steps, you can compute its Keys and Values once, cache them, and reuse them for all 24 steps. This is a huge speed win.

VAE (Variational Autoencoder)

made withHyperFrames The VAE compresses video frames into a tiny latent space — where the diffusion model actually works

A VAEVariational Autoencoder. Compresses high-resolution images/video into a smaller "latent" representation and can decompress back. Like JPEG compression but learned by AI, and much more powerful. compresses data into a smaller representation and decompresses it back.

Think of JPEG compression: a 10MB photo becomes a 500KB file, then gets decompressed back to an image that looks almost identical. A VAE does this but with AI - it learns the most efficient way to compress visual data. The compressed version is called the latent space.

Why this matters: Working with full 1080p video frames directly would be absurdly expensive computationally. Instead, the VAE compresses each frame into a tiny "latent" version, the diffusion model works in this compact space, and then the VAE decompresses the result back to full-size video.

Flow Matching

made withHyperFrames DDPM takes 1000 curvy steps; flow matching takes 24 straight ones

Flow matchingA training method for diffusion models where instead of learning to predict noise, the model learns to predict a "velocity" - the direction and speed to move from noise toward clean data along a straight line. More stable and efficient than older methods. is a modern alternative to the original diffusion training method (DDPMDenoising Diffusion Probabilistic Models. The original (2020) method for training diffusion models. Uses a specific noise schedule and learns to predict the noise added at each step. Flow matching is a newer, often more efficient alternative.).

Original diffusion is like navigating a maze from noise to image - the path is curvy and you need many steps. Flow matching draws a straight line from noise to image and teaches the model to follow it. Straighter path = fewer steps needed = faster generation.

The model learns a velocity fieldIn flow matching, instead of predicting noise, the model predicts the "velocity" (direction + magnitude) needed to move from the current noisy state toward the clean data. Think of it as predicting which way to go at each point. - at each point along the noise-to-data journey, it predicts which direction to go and how fast. Avatar V uses "rectified flow matching," which specifically encourages straight-line paths.

Embeddings

An embeddingA list of numbers (vector) that represents something complex (a face, a voice, a word) in a way that captures its meaning. Similar things have similar number patterns. Used to measure similarity and as input to AI models. is a list of numbers that represents something complex in a compact, meaningful way.

Imagine describing a person's face using just 512 numbers. Number 1 might relate to face shape, number 2 to skin tone, number 37 to nose width, etc. (The model figures out what each number means on its own.) Two people who look similar will have similar number lists. This is a face embedding.

The paper mentions several types:

Identity embedding: Numbers capturing someone's facial appearance
Expression embedding: Numbers capturing facial expressions at a moment
Speaker embedding: Numbers capturing someone's voice characteristics
Text embedding: Numbers representing the meaning of a text prompt
ArcFace embeddingA specific face recognition model (from 2019) that produces high-quality face embeddings. Used as a standard way to measure how similar two faces look. If cosine similarity between two ArcFace embeddings is high, the faces look alike. - A specific, well-known face embedding used for measuring identity similarity

How Avatar V Works

made withHyperFrames Data flows through four stages: Image Engine → VideoRef DiT → Super-Res → VAE Decode

Avatar V has four major components working together:

Reference Video

Audio Track

Text Prompt

🎬

Image Engine

Generates a scene image preserving your face

🧠

VideoRef DiT

Core video generator with Sparse Reference Attention

🔍

Super-Resolution Refiner

Upscales to 1080p with identity awareness

📹

Streaming VAE Decode

Converts from latent space back to pixels

Your Avatar Video

Sparse Reference Attention

made withHyperFrames Standard quadratic attention vs Avatar V's sparse linear attention

This is Avatar V's most important innovation. The core idea:

Instead of squeezing your identity into a small set of numbers (which loses detail), Avatar V keeps the FULL reference video as tokens and lets the generated video "look at" all of them whenever it needs identity information.

But there's a cost problem. With standard attention:

Reference video tokens: let's say 5,000
Generation video tokens: let's say 5,000
Standard attention: every token looks at every other = 10,000 x 10,000 = 100 million comparisons

Sparse Reference Attention's trick:

Generation tokens CAN look at reference tokens (they need identity info)
Reference tokens only look at OTHER reference tokens (they don't need anything from the generation)
This makes the cost linear in reference length instead of quadratic

Imagine a class of students (generation tokens) learning from a set of textbooks (reference tokens). In "standard attention," every student reads every textbook AND every textbook somehow reads every student's notes - wasteful. In Sparse Reference Attention, students read the textbooks, but textbooks don't read student notes. Half the work, same learning.

Deep Dive: What "asymmetric" means here

The attention is asymmetric because the two groups of tokens have different attention rules:

Reference tokens: Self-attention only (look at each other)
Generation tokens: Attend to BOTH generation tokens AND reference tokens

This asymmetry is what makes it "sparse" - not all possible attention connections exist. The missing connections (reference looking at generation) aren't useful anyway, so removing them is free performance.

Motion Representation Stream

This component captures how a person moves, not just how they look.

Everyone has a unique "motion fingerprint." Some people barely move their head when talking. Others are very animated. Some smile with their whole face; others just slightly raise one corner of their mouth. The motion representation stream learns these individual patterns.

It serves two roles simultaneously (called "closed-loop"):

As a learning target: "Given this audio, predict how THIS specific person would move"
As a conditioning signal: "Use these predicted motions to guide video generation"

By doing both, the model develops a unified understanding of each person's motion style.

Super-Resolution Refiner

made withHyperFrames Generic upscaling just makes bigger pixels — Avatar V's refiner uses the reference to reconstruct real detail

The core model generates video at low resolution (for speed). The super-resolutionThe process of taking a low-resolution image/video and generating a higher-resolution version with added detail. Like "enhance" in movies, except it actually works because the AI has learned what realistic detail looks like. refiner upscales it to 1080p.

What makes it special: Unlike generic upscaling (which just makes pixels bigger), Avatar V's refiner has access to the same identity reference video. So when it's enhancing the face region, it can look at the reference to know exactly what your teeth, skin pores, and eye details should look like.

It also uses sparse temporal attentionInstead of each frame looking at ALL other frames during upscaling (expensive), each frame only looks at nearby frames. Since the base model already established smooth motion, the refiner only needs to add local detail, not global consistency. - since the low-resolution video already has smooth, consistent motion, the refiner only needs to look at nearby frames to add detail, not the entire video.

Voice Cloning Engine

From just ~10 seconds of audio, the voice cloning engine can reproduce someone's voice. It's built on an LLMLarge Language Model. A very large neural network trained on text (and sometimes audio) that can generate human-like text, translate, summarize, and more. ChatGPT is an LLM. Here, an LLM-like architecture is used for audio generation. backbone and treats speech generation as predicting a sequence of audio tokensDiscrete codes representing small chunks of audio. Just like text can be broken into word-tokens, audio can be broken into audio-tokens using a codec. The LLM predicts these tokens one by one to generate speech. - similar to how ChatGPT predicts the next word, but for sound.

How It Learns (Training Pipeline)

made withHyperFrames Five training stages, each building on the last — from general video understanding to human-preferred quality

Avatar V doesn't learn everything at once. It follows a 5-stage curriculum, like going from elementary school through grad school:

Text-to-Video

"Learn what video is"

Audio-to-Video

"Learn how lips sync to speech"

Personality Fine-Tuning

"Learn to copy someone's identity"

Distillation

"Learn to do it 10x faster"

Human Feedback

"Learn what humans prefer"

Stage 1: Text-to-Video Pre-Training

The model first learns general video understanding from millions of text-video pairs. "A dog runs across a field" → video of a dog running. This teaches:

How objects move through space
How lighting and physics work
Basic scene composition

Training uses progressive scaling: start with tiny, short videos, gradually increase resolution and duration. Like teaching a child to draw stick figures before oil paintings.

Optimizer: MuonA newer optimizer (2025) designed for training large neural networks. An optimizer is the algorithm that adjusts the model's numbers (weights) during training to make it better. Muon is more efficient than the widely-used Adam optimizer for large models. for most parameters, AdamWA widely-used optimization algorithm. "Adam" adapts learning rates for each parameter; "W" adds weight decay (a regularization technique that prevents parameters from growing too large). The standard choice for training Transformers. for embeddings.

Stage 2: Audio-to-Video Pre-Training

Now the model learns to synchronize lips with speech. Given a face image + audio track, generate a video where the person speaks those words. This stage adds the audio cross-attention modules that connect sound features to visual generation.

Trained on a huge corpus of talking-head videos covering diverse speakers, languages, and styles.

Stage 3: Personality SFT (Supervised Fine-Tuning)

SFTSupervised Fine-Tuning. Taking a pre-trained model and training it further on a specific task with labeled examples. Like a medical student (pre-trained on general medicine) specializing in cardiology (fine-tuned on heart cases). = taking the general model and specializing it for identity-preservation.

The training data is carefully constructed: each example has a target video (what to generate) paired with reference clips of the same person in different scenes. This forces the model to extract identity features that are independent of the background.

This is where Sparse Reference Attention and the motion representation stream are activated.

Human-aware auxiliary losses are added here - extra training signals beyond pixel-level accuracy that specifically target face quality, lip sync, identity similarity, and motion fidelity.

Stage 4: Distillation (Making It 10x Faster)

DistillationTraining a smaller/faster "student" model to mimic a larger/slower "teacher" model. The student learns to produce similar outputs in fewer steps. Like a student learning shortcuts from an experienced teacher. compresses the slow, high-quality model into a fast one.

Phase 1: CFG Distillation

Classifier-Free Guidance (CFG)A technique where the model runs twice per step: once with the conditioning (e.g., "a cat") and once without. The difference between the two is amplified to make the output more closely match the condition. Problem: doubles (or more) the computation per step. normally requires multiple forward passes per step (one with the condition, one without). CFG distillation teaches the model to internalize this, needing only ONE pass.

Phase 2: DMD (Distribution Matching Distillation)

This reduces the number of denoising steps. Uses a three-model setup:

Student: Learns to generate in fewer steps
Fake teacher: Models what the student's outputs look like (trainable)
Real teacher: The original slow model (frozen - doesn't change)

The student learns to make its output distribution match the real teacher's, even though it uses far fewer steps.

Combined result: 10x+ faster inference.

Stage 5: RLHF (Learning from Human Preferences)

RLHFReinforcement Learning from Human Feedback. Humans rate outputs (or compare pairs), and the model learns to produce outputs that score higher. This is how ChatGPT was trained to be helpful and harmless. Here, it's used to make videos look more natural to human eyes. = letting humans judge the outputs and training the model to score higher.

Two approaches are combined:

GRPOGroup Relative Policy Optimization. An RL algorithm that generates a group of outputs, scores them all, then improves the model by learning from the relative rankings within the group. More stable than traditional policy gradient methods. (Group Relative Policy Optimization): Generate multiple videos, score them with reward functions (identity similarity, motion naturalness, visual quality), improve based on relative rankings.
DPODirect Preference Optimization. Instead of training a separate reward model, DPO directly learns from human preference pairs ("I prefer video A over video B"). Simpler than RLHF with reward models, and often equally effective. (Direct Preference Optimization): Learn directly from human-annotated pairs ("this video is better than that one").

KL regularizationA mathematical constraint that prevents the model from changing too much during RLHF. Without it, the model might "hack" the reward function by producing weird outputs that score high but look terrible. KL divergence measures how different the new model is from the old one. prevents the model from drifting too far from its pre-RLHF capabilities.

How It Generates Video (Inference)

InferenceThe process of using a trained model to generate new outputs. Training = learning. Inference = doing. When you type a prompt into ChatGPT, the response is generated during "inference." is when the trained model actually generates a video. Here's how:

Chunk-Based Generation

made withHyperFrames Overlapping chunks enable unlimited-duration video generation

Avatar V generates video in chunks of ~6.4 seconds each. For longer videos, chunks are stitched together:

First chunk: Uses the reference video directly to establish identity
Subsequent chunks: Use the last frames of the previous chunk as a bridge to maintain continuity
A global appearance anchor from the first chunk keeps identity consistent across all chunks

Like writing a long essay in paragraphs. Each new paragraph starts by referencing the end of the previous one to maintain flow. And you keep a photo of the main character on your desk to stay consistent.

Speed Optimizations

The paper describes several clever tricks to make generation fast enough for production:

Context caching: The reference video never changes between denoising steps, so compute it once and reuse it for all 24 steps
Sequence parallelism: Spread the long token sequence across 8 GPUs using Ulysses Sequence ParallelismA method for splitting a long sequence across multiple GPUs. Each GPU processes a portion of the sequence, with all-to-all communication when attention needs to see tokens on other GPUs. Named after the long novel by James Joyce.
AI-written GPU code: They used an LLM to write optimized low-level GPU programs that fuse many small operations into single large ones, reducing overhead by 3x
Overlapped communication: GPUs transfer data to each other at the same time as they compute, hiding the communication cost
GPU clock locking: In distributed inference, the slowest GPU determines speed. They lock all GPUs to a stable frequency to eliminate variance.

Data: Fuel for the Model

Avatar V was trained on a massive dataset: 100M+ clips curated from 50M raw videos.

The Data Pipeline

Raw videos go through a multi-stage filtering cascade:

Normalize resolution (640px) and frame rate (25 frames per second)

Reject static or choppy content by measuring how much changes between frames

Detect humans and faces using AI detection models

Score visual quality using an AI model trained to match human quality judgments

Smart clipping — automatically find the best start and end points for each clip

Scene-cut detection and content filtering (reject screencasts, games, static photos)

Categorize across 15 dimensions and deduplicate by finding near-identical clips and removing copies

Cross-Clip Identity Connectivity

A critical data innovation: they build a graph connecting video clips of the same person across different scenes. Two clips are linked if:

Same person (high face similarity)
Different scene (low background similarity)
Long enough to capture motion patterns

This lets the model learn "this is the same person even though the background, lighting, and camera angle are completely different" - essential for identity that doesn't depend on the scene.

Infrastructure at Scale

Avatar V runs on 5,000+ GPUs across multiple cloud providers. Two key infrastructure pieces:

HELIOS

A unified platform that makes GPUs from 5+ providers and 10+ regions act as a single pool. Key ideas:

Cell-based architecture: GPUs organized into standardized isolated groups called "cells." Problems in one cell don't spread to others.
Priority-aware scheduling: User-facing video generation gets highest priority. Training gets large stable blocks. Data processing fills the gaps.
Improved GPU utilization by 15% and reduced wasted GPU time by ~20%.

Custom Data Processing Engine

They outgrew RayA popular open-source framework for distributed computing in Python. Good for moderate scale, but its centralized coordination (Global Control Store) becomes a bottleneck at very large scale (2000+ nodes). (a popular distributed computing framework) at 2,000+ nodes. Built a replacement using a different coordination model:

Instead of the boss calling each worker and saying "do task X" (imperative/command model), the boss posts a bulletin board saying "I need X done" and workers independently check the board and do what's needed (declarative model). If a worker crashes and restarts, they just check the board again. No messages get lost.

How Good Is It?

Automated Metrics

Compared against Kling O3 Pro, Veo 3.1, OmniHuman 1.5, and Seedance 2.0:

Metric	What It Measures	Avatar V	Best Competitor
SyncNetA model that measures how well lip movements match audio. Higher confidence = better lip sync. A standard benchmark in talking-head research. Confidence	Lip-audio sync quality	8.97	8.86 (Seedance)
Face Similarity	Identity preservation	0.840	0.838 (Kling)
Q-AlignA vision-language model that scores image/video quality on a scale calibrated to human opinion. Essentially an AI that rates visual quality like a human would.	Visual quality	4.85	4.95 (Veo 3.1*)

*Veo 3.1 wins on visual quality but severely sacrifices identity (Face Sim = 0.714). Over-sharpening inflates its quality score.

Human Evaluation

Avatar V scored highest on all 6 dimensions rated by trained human annotators (5-point scale):

Identity: 4.98/5 (near perfect)
Lip Sync: 4.69/5
Motion Naturalness: 4.48/5
Visual Quality: 4.78/5

The Turing Test

In a "is it real?" test, human annotators correctly identified the real video 77.8% of the time. But in 61% of test cases, at least one of three annotators was fooled by the AI-generated video.

Quick Check: Why does Avatar V use a video reference instead of a single image?

A single image is too low resolution A video captures both static appearance AND dynamic behavioral patterns (talking style, expressions, gestures) Videos are easier to process than images A video provides more pixels for the model to copy

Correct! A single image can only show one angle, one expression, one lighting condition. A video shows how the person actually moves, talks, and emotes - their behavioral "fingerprint."

Not quite. The key advantage is that video captures dynamic behavioral patterns - talking rhythm, micro-expressions, gestural tendencies - that a single photo simply cannot represent.

Final Comprehension Quiz

What does "Sparse Reference Attention" solve?

The model is too small to generate video The quadratic cost of attending to all reference tokens by making attention asymmetric (generation tokens look at reference, but not vice versa) Audio-visual synchronization The model forgets training data

Correct! It makes attending to long reference videos computationally feasible by removing unnecessary attention connections.

Not quite. Sparse Reference Attention specifically addresses the quadratic attention cost by making attention asymmetric - generation tokens attend to reference tokens, but reference tokens only self-attend.

Why does the training pipeline have 5 stages instead of training everything at once?

They didn't have enough GPUs to do it all at once Each stage builds on the previous one - the model needs general video understanding before it can learn lip sync, identity preservation, speed optimization, and human preference alignment Each stage uses a different programming language Regulations require staged training

Correct! Like a curriculum - you learn basic math before calculus. The model needs to understand video before it can specialize in talking heads, identity, etc.

The correct answer is B. Progressive training is a curriculum approach where each stage builds on the last. You can't learn identity preservation before understanding what video is.

What is "distillation" in the context of this paper?

Removing impurities from the training data Compressing a slow, high-quality model into a fast one that produces similar-quality output in fewer steps Converting video to audio Reducing the number of GPUs needed

Correct! Distillation trains a fast "student" model to mimic a slow "teacher" model, achieving 10x+ speedup.

Distillation = training a faster student model to match the quality of a slower teacher model. Avatar V combines CFG distillation and DMD to achieve 10x+ speedup.

What does RLHF do for Avatar V?

Makes the model generate videos faster Reduces the model's size Aligns the model's outputs with human perceptual preferences for identity, motion naturalness, and visual quality Adds audio to silent videos

Correct! RLHF is the final polish - making the output match what humans actually consider "good" across identity, motion, and visual quality dimensions.

RLHF (Reinforcement Learning from Human Feedback) tunes the model based on human judgments about what looks natural, preserves identity well, and has good visual quality.

Why did HeyGen replace Ray with a custom data processing engine?

Ray was too expensive Ray's centralized coordination (GCS) couldn't scale past 2000+ nodes - it consumed 100GB+ RAM and crashed under load Ray doesn't support Python They wanted to use Kubernetes instead

Correct! Ray's centralized Global Control Store became a bottleneck at their scale, leading them to build a declarative, Kubernetes-inspired system.

The answer is B. Ray's GCS is a centralized coordination point where every state change must be broadcast to every node. At 2000+ nodes, it consumed 100GB+ RSS and crashed.

Why This Paper Matters

For Video Production Teams

Avatar V represents a shift from "generic AI video" to "personalized AI video at scale." Previous systems could generate videos of a generic person talking, but couldn't faithfully reproduce a specific person's talking rhythm, micro-expressions, and gestural tendencies. For companies creating personalized video content — training videos, marketing, customer support, localization — this means AI-generated avatars that are actually recognizable as the real person, not just visually similar.

For the Research Community

The paper introduces several techniques with broad applicability beyond avatars:

Sparse Reference Attention solves the quadratic scaling problem for conditioning on long reference contexts — relevant for any video generation system that conditions on reference material
The motion representation stream demonstrates that identity and motion can be disentangled and transferred separately, opening the door to motion style transfer across different identities
The five-stage progressive training pipeline provides a practical template for training complex generative systems — starting broad (text-to-video), then specializing (lip sync, identity, speed, quality) in stages rather than trying to learn everything at once

The Bigger Picture

Avatar V points toward a future where video communication is no longer bottlenecked by the physical availability of the speaker. A CEO could record a 30-second reference video and generate personalized messages to thousands of employees in their own talking style. Educational content could be delivered by an instructor's avatar speaking any of 50+ languages while preserving their teaching mannerisms. The technology raises important questions about consent, deepfakes, and authenticity — but the production-quality bar it sets (1080p, unlimited duration, state-of-the-art fidelity) means these conversations are no longer hypothetical.

AI Glossary

Attention

Mechanism where each element looks at all others to determine relevance. The core innovation of Transformers.

AdaLN (Adaptive Layer Normalization)

A normalization technique where the scaling/shifting parameters come from external conditioning (like a timestep). Lets the model adjust its behavior based on context.

AdamW

Popular optimizer for training neural networks. Adapts learning rates per-parameter and adds weight decay regularization.

ArcFace

A face recognition model that produces embeddings optimized for identity comparison. The standard for measuring face similarity in research.

Autoregressive

Generating output one piece at a time, where each new piece depends on previous ones. Like writing a sentence word by word.

CFG (Classifier-Free Guidance)

Technique that improves conditional generation by comparing conditional and unconditional outputs. Requires extra forward passes (2x or more cost).

Cross-Attention

Attention between two different sequences (e.g., video looking at audio features). Lets one modality gather information from another.

CUDA

NVIDIA's programming platform for writing code that runs on GPUs. Custom CUDA kernels are hand-optimized GPU programs.

DAG (Directed Acyclic Graph)

A flowchart-like structure where data flows in one direction with no loops. Used to represent multi-step processing pipelines.

DDPM

Denoising Diffusion Probabilistic Models. The original 2020 framework for diffusion-based generation. Flow matching is a more modern alternative.

Denoising

The process of removing noise from data. In diffusion models, this is the generation process: starting from pure noise and progressively cleaning it up.

Diffusion Model

AI that generates data by learning to reverse a noise-adding process. Train by adding noise to real data; generate by removing noise from random static.

DiT (Diffusion Transformer)

A Transformer architecture designed for diffusion models. Replaced U-Net as the dominant backbone for large-scale image/video generation.

Distillation

Training a fast "student" model to mimic a slow "teacher" model. Achieves similar quality at a fraction of the computation.

DMD (Distribution Matching Distillation)

A distillation method that reduces the number of denoising steps needed. Uses a 3-model architecture: student, fake teacher, and frozen real teacher.

DPO (Direct Preference Optimization)

An alignment method that learns from human preference pairs ("A is better than B") without needing a separate reward model. Simpler than full RLHF.

Embedding

A compact numerical representation (vector) of something complex. Similar items have similar embeddings. Used for faces, text, audio, etc.

Feed-Forward Network (FFN)

A simple neural network layer that processes each token independently (no looking at other tokens). Appears in every Transformer block after attention.

Fine-Tuning

Taking a pre-trained model and training it further on a specific task or dataset. Adapts general knowledge to a particular use case.

Flash Attention

An optimized implementation of attention that's much faster and uses less memory than the naive version. Standard in modern Transformers.

Flow Matching

A training method where the model learns to transport noise to data along straight lines. More efficient than DDPM and requires fewer denoising steps.

FSDP (Fully Sharded Data Parallelism)

A method for training models too large for a single GPU by splitting (sharding) model parameters across multiple GPUs.

Gaussian Noise

Random values drawn from a normal (bell curve) distribution. The type of noise used in diffusion models to corrupt and then denoise data.

GEMM (General Matrix Multiplication)

The fundamental math operation in neural networks. GPUs are optimized for this operation; cuBLAS is NVIDIA's optimized GEMM library.

GRPO (Group Relative Policy Optimization)

An RL algorithm that generates a group of outputs, scores them, and improves based on relative rankings within the group.

Identity Drift

When a generated character gradually stops looking like the reference person over time. A major failure mode in video generation.

Inference

Using a trained model to generate new outputs. Training = learning. Inference = doing the thing. The production phase.

KL Divergence / KL Regularization

A mathematical measure of how different two probability distributions are. Used as a constraint to prevent models from changing too dramatically during fine-tuning.

KV Cache

Storing pre-computed Key and Value matrices from attention so they don't need to be recomputed. Essential optimization for sequence generation.

Latent Space

The compressed representation space where a VAE stores its encoded data. Diffusion models work here for efficiency rather than in pixel space.

Likert Scale

A rating scale (typically 1-5 or 1-7) used in surveys. "Rate this video's quality from 1 (very poor) to 5 (excellent)."

LLM (Large Language Model)

A very large neural network trained on text that can generate, translate, and understand language. ChatGPT, Claude, etc.

Loss Function

A mathematical formula that measures how wrong the model's predictions are. Training tries to minimize this. Different losses target different aspects of quality.

MOS (Mean Opinion Score)

The average human rating of a system's quality. A standard evaluation metric in multimedia research.

Muon Optimizer

A newer (2025) optimizer for large neural networks, more efficient than Adam for 2D+ weight tensors in very large models.

NCCL / NVSHMEM

NVIDIA communication libraries for GPU-to-GPU data transfer. NCCL works at operation level; NVSHMEM enables finer-grained, tile-level communication.

NUMA (Non-Uniform Memory Access)

A computer architecture where memory access speed depends on which CPU/GPU the memory is closest to. Pinning processes to the right NUMA node improves performance.

NVLink

NVIDIA's high-speed direct GPU-to-GPU connection. Much faster than going through the CPU.

Patchify

Breaking an image/video into small patches (e.g., 16x16 pixel squares) that become tokens for the Transformer. The visual equivalent of word tokenization.

Perceptual Loss

A loss function that measures image similarity using a pre-trained network's features rather than raw pixels. Better matches human perception of quality.

Phoneme

The smallest unit of speech sound. "cat" has 3 phonemes: /k/, /ae/, /t/. Lip sync must align visual mouth shapes to these.

Pre-Training

The initial training phase on a large, general dataset. Builds foundational knowledge before specialization through fine-tuning.

Progressive Training

Gradually increasing difficulty during training (low→high resolution, short→long videos). More stable than starting at full difficulty.

RLHF (Reinforcement Learning from Human Feedback)

Training a model to produce outputs humans prefer. Humans rate outputs; the model learns to maximize these ratings.

RoPE (Rotary Position Embedding)

A way to encode position information (where a token is in the sequence) into the attention computation. Used in modern Transformers.

Self-Attention

Attention within a single sequence. Each element looks at all other elements in the same sequence to gather context.

Sequence Parallelism

Splitting a long sequence across multiple GPUs, each processing a portion. Required when sequences are too long for one GPU's memory.

SFT (Supervised Fine-Tuning)

Fine-tuning with labeled examples showing the correct output for each input. "Here's the input; here's what you should produce."

SM (Streaming Multiprocessor)

The basic processing unit inside an NVIDIA GPU. A GPU has many SMs that run in parallel. Allocating SMs between tasks is a key optimization.

Stochastic

Involving randomness. "Stochastic sampling" means adding controlled randomness to the generation process, which can improve detail quality.

Super-Resolution

Upscaling low-resolution content to higher resolution while adding realistic detail. The AI version of "enhance."

SyncNet

A standard model for measuring audio-visual lip sync quality. Reports confidence (higher = better sync) and distance (lower = better).

Tokens

The basic units a Transformer processes. Text → word-piece tokens. Images → patch tokens. Video → space-time patch tokens.

Transformer

The dominant neural network architecture (since 2017). Uses attention to process sequences. Powers ChatGPT, image generators, and video generators.

Triton

A programming language by OpenAI for writing GPU kernels in Python-like syntax. Easier than raw CUDA but still highly optimized.

Trimmed Mean

An average that discards the extreme values (e.g., top and bottom 10%). More robust to outliers than a regular average.

U-Net

The older backbone architecture for diffusion models (before DiT). Has an encoder-decoder structure with skip connections. Being replaced by Transformers.

VAE (Variational Autoencoder)

A neural network that compresses data to a compact latent representation and decompresses back. Used in diffusion models to work in latent space for efficiency.

Velocity Field

In flow matching, the model predicts the direction and speed to move from noise toward data at each point. Like predicting wind currents to navigate from A to B.

VLM (Vision-Language Model)

An AI that understands both images and text together. Can describe images, answer questions about them, or filter content. Used in data curation.