HunyuanVideo-Avatar: Animate Any Portrait with Audio

Key Takeaways

HunyuanVideo-Avatar is an open-source, audio-driven human animation model from Tencent Hunyuan + Tencent Music Entertainment Lyra Lab, published on arXiv May 26, 2025 (arXiv:2505.20156).
It is built on a Multimodal Diffusion Transformer (MM-DiT) architecture and introduces three novel modules: a Character Image Injection Module, an Audio Emotion Module (AEM), and a Face-Aware Audio Adapter (FAA).
It supports multi-character dialogue animation — multiple avatars in a single scene driven by independent audio streams, isolated via latent-level face masks.
Input: any portrait image (photorealistic, cartoon, 3D-rendered, anthropomorphic) at arbitrary resolution + audio. Output: a high-dynamic talking-head video with synced emotion.
Minimum GPU: 24GB VRAM (slow). Recommended: 80–96GB. Runs on 10GB VRAM via TeaCache (thanks to Wan2GP).
Available free: GitHub (2K stars, 327 forks) · Hugging Face (324 likes) · Live demo.
Bonus at the end: How Gaga AI pairs with HunyuanVideo-Avatar for image-to-video, audio infusion, voice cloning, and TTS.

Table of Contents

What Is HunyuanVideo-Avatar?

HunyuanVideo-Avatar is an open-source AI model that animates a single portrait image into a realistic, emotion-controlled talking avatar video — driven entirely by an audio input.

The full title is “HunyuanVideo-Avatar: High-Fidelity Audio-Driven Human Animation for Multiple Characters.” It was jointly developed by Tencent Hunyuan and Tencent Music Entertainment Lyra Lab, submitted to arXiv on May 26, 2025 (v2 revised June 3, 2025).

The model does three things that prior systems struggled with simultaneously:

Generates highly dynamic video (natural head movement, background motion) while keeping the character identity consistent frame to frame
Controls facial emotion based on a separate emotion reference image — not just lip sync, but actual expressive emotion transfer
Animates multiple characters in the same scene, each driven by an independent audio stream, without cross-character bleed

Authors: Yi Chen, Sen Liang, Zixiang Zhou, Ziyao Huang, Yifeng Ma, Junshu Tang, Qin Lin, Yuan Zhou, Qinglin Lu

Where to find it:

💻 GitHub: Tencent-Hunyuan/HunyuanVideo-Avatar — 2K stars, 327 forks
🤗 Hugging Face: tencent/HunyuanVideo-Avatar — 324 likes
🎮 Live demo (Tencent Hunyuan platform)
📄 arXiv: 2505.20156
🌐 Project page with video demos

Why Audio-Driven Avatar Animation Is Still Hard

Generating a talking avatar that looks and feels real requires solving three problems at once — and most existing models only solve one or two.

Problem 1: Dynamic Motion vs. Character Consistency

Most audio-driven avatar models face a trade-off: the more dynamic the motion (natural head movement, body sway, background life), the more the character’s face drifts from the original image over time. Prior methods used simple “addition-based character conditioning” — adding the reference image feature directly to the noise signal. This creates a mismatch between how the reference is used during training vs. inference, leading to identity drift in long sequences.

Problem 2: Emotion Alignment

Lip sync is solved at a basic level by most models. What isn’t solved: making the character look like they feel what they’re saying. Sad audio should produce a sad expression; excited audio should produce an energetic face. This requires extracting and transferring emotional cues — not just phoneme alignment.

Problem 3: Multi-Character Scenes

When multiple people are in a single frame, each driven by their own audio stream, the model needs to know which audio belongs to which face — and ensure the facial motion of character A doesn’t influence character B. This “audio bleed” problem is not addressed at all in most single-character avatar models.

HunyuanVideo-Avatar was specifically designed to solve all three problems, which is why the architecture introduces three new modules rather than refining an existing one.

The Three Core Innovations in HunyuanVideo-Avatar

HunyuanVideo-Avatar’s architecture is built on MM-DiT (Multimodal Diffusion Transformer) and introduces three new modules — each solving one of the three core problems above.

1. Character Image Injection Module

What it solves: Character identity drift in dynamic video.

What it does: Replaces the conventional addition-based character conditioning with a dedicated injection module. Instead of adding the character image features to the latent noise (which creates a training-inference mismatch), the injection module conditions the diffusion process in a way that’s consistent between training and inference.

Result: The character’s face, proportions, and style remain stable across all frames — even when the background moves dynamically and the head rotates significantly.

2. Audio Emotion Module (AEM)

What it solves: Emotion alignment between audio content and facial expression.

What it does: The AEM extracts emotional cues from a separate emotion reference image provided by the user. It doesn’t interpret emotion from audio waveforms directly (audio emotion recognition is imprecise); instead, it uses a visual reference to define the target emotional style, then transfers that style into the generated video conditioned on the audio.

Result: Fine-grained, accurate emotion control. A user can specify “this character should look sad” by providing a sad reference image — and the generated video will reflect that emotion while still being driven by the audio for lip sync.

3. Face-Aware Audio Adapter (FAA)

What it solves: Multi-character audio bleed.

What it does: The FAA isolates each character using a latent-level face mask — a spatial mask applied in the diffusion model’s latent space that corresponds to each character’s face region. Each character receives their audio signal via an independent cross-attention mechanism, routed through their own face mask. This prevents audio signal for character A from influencing the facial motion of character B.

Result: True multi-character animation — two or more avatars in the same video frame, each with independent lip sync and motion, without interference.

What HunyuanVideo-Avatar Can Generate

HunyuanVideo-Avatar accepts any portrait image style and generates video at multiple scales — portrait, upper-body, or full-body.

Supported Input Styles

Photorealistic — standard photography portraits
Cartoon / Illustrated — 2D anime, flat-art, comic styles
3D Rendered — CGI characters, game assets
Anthropomorphic — non-human characters with human-like features

Supported Output Scales

Portrait — face and neck, tight frame
Upper-body — shoulders and above
Full-body — complete figure with environment

Key Generation Characteristics

High-dynamic foreground and background — not a static background with a moving face
Arbitrary input resolution and scale (no fixed crop requirement)
Emotion-controllable via reference image
Multi-character support via FAA

Use Cases

E-commerce — product presenters, virtual salespeople
Online streaming — virtual streamers and VTubers
Social media content — animated profile videos, scripted short-form
Video content creation and editing — dialogue scenes with multiple AI characters
Education and training — AI instructors from any portrait
Entertainment — animated historical figures, fictional characters

System Requirements

HunyuanVideo-Avatar requires an NVIDIA GPU with CUDA support. The minimum for any inference is 24GB VRAM; the recommended setup for quality output is 80–96GB VRAM.

Setup	GPU Memory	Speed	Notes
Minimum (standard)	24GB	Very slow	704×768, 129 frames
Recommended	80–96GB	Production speed	Full quality output
Low VRAM (FP8 + CPU offload)	< 24GB	Slow	–cpu-offload flag
Ultra-low VRAM (TeaCache)	10GB	Moderate	Via Wan2GP integration

Tip: If you experience OOM (out-of-memory) errors on an 80GB GPU, reduce input image resolution. The model was tested on an 8-GPU machine.

Software requirements:

OS: Linux (tested; Windows not officially supported)
CUDA: 12.4 (recommended) or 11.8
Python: 3.10.9
PyTorch: 2.4.0
Flash Attention v2.6.3 (for acceleration)

How to Install HunyuanVideo-Avatar: Step-by-Step

HunyuanVideo-Avatar installs via conda + pip on Linux. A Docker image is also available for the fastest path to inference.

Option A: Install from Scratch (Linux)

Step 1 — Clone the repository:

git clone https://github.com/Tencent-Hunyuan/HunyuanVideo-Avatar.git

cd HunyuanVideo-Avatar

Step 2 — Create and activate the conda environment:

conda create -n HunyuanVideo-Avatar python==3.10.9

conda activate HunyuanVideo-Avatar

Step 3 — Install PyTorch:

# For CUDA 12.4 (recommended)

conda install pytorch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 pytorch-cuda=12.4 -c pytorch -c nvidia

# For CUDA 11.8

conda install pytorch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 pytorch-cuda=11.8 -c pytorch -c nvidia

Step 4 — Install pip dependencies:

python -m pip install -r requirements.txt

Step 5 — Install Flash Attention (recommended for speed):

python -m pip install ninja

python -m pip install git+https://github.com/Dao-AILab/flash-attention.git@v2.6.3

Option B: Docker (Fastest Setup)

# CUDA 12.4 (recommended — avoids float point exceptions)

docker pull hunyuanvideo/hunyuanvideo:cuda_12

docker run -itd –gpus all –init –net=host –uts=host –ipc=host \

–name hunyuanvideo –security-opt=seccomp=unconfined \

–ulimit=stack=67108864 –ulimit=memlock=-1 –privileged \

hunyuanvideo/hunyuanvideo:cuda_12

# Then install additional packages inside the container:

pip install gradio==3.39.0 diffusers==0.33.0 transformers==4.41.2

For CUDA 11.8, replace cuda_12 with cuda_11 in the pull command.

Step 6 — Download Pretrained Model Weights

Follow the download instructions in the official README weights section. Place the downloaded checkpoints in the ./weights directory as specified.

The checkpoint structure expected:

weights/

└── ckpts/

└── hunyuan-video-t2v-720p/

└── transformers/

├── mp_rank_00_model_states.pt ← full precision

└── mp_rank_00_model_states_fp8.pt ← FP8 (single GPU)

How to Run Inference

Multi-GPU Inference (Recommended for Quality)

8-GPU parallel inference — best output quality:

cd HunyuanVideo-Avatar

export PYTHONPATH=./

export MODEL_BASE=”./weights”

checkpoint_path=${MODEL_BASE}/ckpts/hunyuan-video-t2v-720p/transformers/mp_rank_00_model_states.pt

torchrun –nnodes=1 –nproc_per_node=8 –master_port 29605 hymm_sp/sample_batch.py \

–input ‘assets/test.csv’ \

–ckpt ${checkpoint_path} \

–sample-n-frames 129 \

–seed 128 \

–image-size 704 \

–cfg-scale 7.5 \

–infer-steps 50 \

–use-deepcache 1 \

–flow-shift-eval-video 5.0 \

–save-path ./results

Single-GPU Inference (FP8 Mode)

export PYTHONPATH=./

export MODEL_BASE=./weights

export DISABLE_SP=1

checkpoint_path=${MODEL_BASE}/ckpts/hunyuan-video-t2v-720p/transformers/mp_rank_00_model_states_fp8.pt

CUDA_VISIBLE_DEVICES=0 python3 hymm_sp/sample_gpu_poor.py \

–input ‘assets/test.csv’ \

–ckpt ${checkpoint_path} \

–sample-n-frames 129 \

–seed 128 \

–image-size 704 \

–cfg-scale 7.5 \

–infer-steps 50 \

–use-deepcache 1 \

–flow-shift-eval-video 5.0 \

–save-path ./results-single \

–use-fp8 \

–infer-min

Very Low VRAM Mode (CPU Offload)

export CPU_OFFLOAD=1

CUDA_VISIBLE_DEVICES=0 python3 hymm_sp/sample_gpu_poor.py \

–input ‘assets/test.csv’ \

–ckpt ${checkpoint_path} \

–sample-n-frames 129 \

–seed 128 \

–image-size 704 \

–cfg-scale 7.5 \

–infer-steps 50 \

–use-deepcache 1 \

–flow-shift-eval-video 5.0 \

–save-path ./results-poor \

–use-fp8 \

–cpu-offload \

–infer-min

10GB VRAM Mode (TeaCache via Wan2GP)

HunyuanVideo-Avatar now supports 10GB VRAM single-GPU inference thanks to the Wan2GP integration with TeaCache. Check the Wan2GP repository for the specific launch instructions — no quality degradation reported.

Gradio Web UI

cd HunyuanVideo-Avatar

bash ./scripts/run_gradio.sh

Opens a local browser UI. No .csv file preparation required — you interact directly with the interface.

Input Format

Inference input uses a .csv file (assets/test.csv) specifying:

Portrait image path
Audio file path
(Optional) Emotion reference image path
Output resolution and frame count settings

Refer to the assets/ folder in the repo for working examples before modifying for your own content.

Common Problems and Fixes

Float Point Exception (Core Dump) on Startup

Cause: cuBLAS / cuDNN version mismatch on certain GPU types.

Fix Option 1 — Update cuBLAS:

pip install nvidia-cublas-cu12==12.4.5.8

export LD_LIBRARY_PATH=/opt/conda/lib/python3.8/site-packages/nvidia/cublas/lib/

Fix Option 2 — Force CUDA 11.8 packages:

pip uninstall -r requirements.txt

pip install torch==2.4.0 –index-url https://download.pytorch.org/whl/cu118

pip install -r requirements.txt

pip install ninja

pip install git+https://github.com/Dao-AILab/flash-attention.git@v2.6.3

Out of Memory (OOM) on 80GB GPU

Cause: Input image resolution is too high for the available VRAM budget.

Fix: Reduce image resolution before passing as input. The model has no fixed minimum resolution, so scaling down the portrait image directly reduces peak VRAM usage.

Character Identity Drifts Mid-Video

Cause: This was the core problem in prior models, which the Character Image Injection Module addresses. If drift still occurs, it likely indicates the reference image has conflicting or ambiguous identity signals (e.g., heavy post-processing, non-frontal poses).

Fix: Use a clean, front-facing, well-lit portrait as the character reference. Avoid heavily filtered or stylized images unless the output style is intentionally artistic.

Multi-Character Audio Bleed

Cause: Incorrect face mask assignment in the input CSV, or face regions overlapping significantly.

Fix: Ensure each character’s face region is correctly defined in the input. The FAA uses latent-level face masks — verify the spatial coordinates match each character’s face position in the frame.

HunyuanVideo-Avatar vs. Other Talking Avatar Models

HunyuanVideo-Avatar is the only open-source model that simultaneously addresses high-dynamic generation, emotion controllability, and multi-character animation in a single framework.

Feature	HunyuanVideo-Avatar	SadTalker	EMO	AniPortrait
Architecture	MM-DiT	3DMM + diffusion	Diffusion	Diffusion
Multi-character support	✅ Yes (FAA)	❌ No	❌ No	❌ No
Emotion reference control	✅ Yes (AEM)	Limited	Limited	❌ No
Character identity stability	✅ Strong (injection module)	Moderate	Good	Moderate
Input style variety	✅ Photo/cartoon/3D/anthro	Photo mainly	Photo mainly	Photo mainly
Min VRAM	10GB (TeaCache)	~6GB	~16GB	~8GB
Open-source	✅ Yes	✅ Yes	❌ No	✅ Yes
Live demo available	✅ Yes	Limited	❌	❌

The gap is largest on multi-character support — no other open-source model in this space handles it at the architecture level.

Bonus: Go Further with Gaga AI

HunyuanVideo-Avatar animates your portrait with audio. Gaga AI takes that output and builds a full video production around it — adding generated backgrounds, original audio, a cloned voice, and TTS narration.

The four Gaga AI modules that pair most directly with a HunyuanVideo-Avatar workflow:

Generate Video Free

Learn Gaga AI

Image-to-Video AI

Gaga AI’s image-to-video engine generates a cinematic motion sequence from a single still image — including AI-generated backgrounds to composite behind your HunyuanVideo-Avatar character.

Take a background scene image — a studio, an outdoor environment, a branded space — and prompt Gaga AI to animate it: “slow zoom into a softly lit studio interior” or “city skyline at sunset with subtle wind motion.” The result is a moving background that wraps your avatar in a fully dynamic scene.

Best for: Music video backdrops, branded content, social media video series, virtual studio setups.

Video and Audio Infusion

Gaga AI analyzes the combined video output from HunyuanVideo-Avatar and generates or binds synchronized audio — ambient sound, music, or environmental effects matched to the visual scene.

After the avatar animation is rendered, Gaga AI’s audio infusion layer adds the surrounding audio context: the hum of a crowd, background music, or environmental sound. This transforms an audio-driven avatar clip into a complete audiovisual experience.

Best for: Final production assembly, social video publishing, short-form content at scale.

AI Avatar (Gaga AI’s own avatar system)

Gaga AI offers its own avatar generation system — complementary to HunyuanVideo-Avatar — for cases where you need a synthetic presenter generated from scratch rather than animated from a photo.

Where HunyuanVideo-Avatar starts from a real portrait image and animates it, Gaga AI’s avatar system creates the presenter from text or minimal reference, then drives it with audio. The two tools cover different ends of the same workflow spectrum.

Best for: Cases where no source portrait exists, AI influencer creation, scalable presenter content.

AI Voice Clone + TTS

Gaga AI clones a target speaker’s voice from a short audio sample and generates narration, dialogue, or commentary in that voice from any text input.

Workflow with HunyuanVideo-Avatar:

Record 30–60 seconds of your presenter’s voice
Clone the voice in Gaga AI
Write the script as text
Generate narration in the cloned voice
Feed that audio into HunyuanVideo-Avatar as the driving audio input
The avatar’s lips and emotion sync to the cloned voice output

This is the pipeline for creating fully synthetic presenter videos where the speaker never re-records after the initial voice sample — including for multilingual versions.

Best for: Multilingual content production, AI spokesperson series, scalable narration at volume.

The Complete Pipeline

Portrait Image + Script Text

↓

Gaga AI Voice Clone + TTS ── Generate natural narration in cloned voice

↓

HunyuanVideo-Avatar ── Animate portrait with AEM emotion + FAA multi-char

↓

Gaga AI Image-to-Video ── Animate the background scene

↓

Gaga AI Audio Infusion ── Add ambient audio, music, environmental sound

↓

Final Video ── Studio-quality AI presenter content, no studio required

Frequently Asked Questions

What is HunyuanVideo-Avatar?

HunyuanVideo-Avatar is an open-source audio-driven human animation model from Tencent Hunyuan and Tencent Music Entertainment Lyra Lab. It takes a portrait image and an audio clip as input and generates a realistic talking avatar video with emotion control and multi-character support. The paper was submitted to arXiv on May 26, 2025 (arXiv:2505.20156).

What makes HunyuanVideo-Avatar different from other talking avatar models?

HunyuanVideo-Avatar introduces three architectural innovations not found together in any other open-source model: (1) a Character Image Injection Module for stable character identity in dynamic video, (2) an Audio Emotion Module (AEM) for emotion transfer via a reference image, and (3) a Face-Aware Audio Adapter (FAA) that enables true multi-character animation with independent audio streams.

Does HunyuanVideo-Avatar support cartoon or non-photorealistic characters?

Yes. HunyuanVideo-Avatar supports photorealistic, cartoon/illustrated, 3D-rendered, and anthropomorphic character styles at arbitrary input resolutions. It is not limited to realistic photographs.

What GPU do I need to run HunyuanVideo-Avatar?

The minimum is 24GB VRAM, though this is described as “very slow.” The recommended setup is 80–96GB VRAM for quality output. A 10GB VRAM option is available via the Wan2GP TeaCache integration, added June 6, 2025.

Is HunyuanVideo-Avatar free?

Yes. The inference code and model weights are open-source, released May 28, 2025. A live cloud demo is also available on the Tencent Hunyuan platform at no cost for testing.

Does HunyuanVideo-Avatar work on Windows?

The officially tested operating system is Linux. Windows is not officially supported. Users attempting Windows installation should expect unsupported configuration issues.

What is the Audio Emotion Module (AEM)?

The AEM is a component of HunyuanVideo-Avatar that extracts emotional style from a user-provided emotion reference image and transfers it to the generated video. This allows precise control over the character’s facial expressions — independent of the emotional content of the driving audio — enabling fine-grained emotion direction.

What is the Face-Aware Audio Adapter (FAA)?

The FAA is the module that enables multi-character animation. It uses a latent-level face mask to isolate each character’s face region in the diffusion model’s latent space, then routes each character’s audio signal through a dedicated cross-attention mechanism. This prevents audio intended for one character from influencing another character’s facial motion.

Can HunyuanVideo-Avatar generate full-body video?

Yes. The model supports portrait (face/neck), upper-body, and full-body generation scales.

How do I cite HunyuanVideo-Avatar in research?

@misc{chen2025hunyuanvideoavatarhighfidelityaudiodrivenhuman,

title={HunyuanVideo-Avatar: High-Fidelity Audio-Driven Human Animation for Multiple Characters},

author={Yi Chen and Sen Liang and Zixiang Zhou and Ziyao Huang and Yifeng Ma

and Junshu Tang and Qin Lin and Yuan Zhou and Qinglin Lu},

year={2025},

eprint={2505.20156},

archivePrefix={arXiv},

primaryClass={cs.CV},

url={https://arxiv.org/abs/2505.20156},

}

Is ComfyUI support available for HunyuanVideo-Avatar?

ComfyUI support is on the official open-source roadmap but has not been released yet. Check the GitHub repository for current status.

Official Resources

Resource	Link
Project Page (video demos)	hunyuanvideo-avatar.github.io
GitHub Repository	Tencent-Hunyuan/HunyuanVideo-Avatar
Hugging Face Model	tencent/HunyuanVideo-Avatar
Live Cloud Demo	Tencent Hunyuan Platform
arXiv Paper	arxiv.org/abs/2505.20156
Wan2GP (10GB VRAM support)	Wan2GP on Hugging Face

HunyuanVideo-Avatar: Animate Any Portrait with Audio