{"id":1895,"date":"2026-03-12T14:41:17","date_gmt":"2026-03-12T06:41:17","guid":{"rendered":"https:\/\/gaga.art\/blog\/?p=1895"},"modified":"2026-03-12T14:41:19","modified_gmt":"2026-03-12T06:41:19","slug":"hunyuanvideo-avatar","status":"publish","type":"post","link":"https:\/\/gaga.art\/blog\/hunyuanvideo-avatar\/","title":{"rendered":"HunyuanVideo-Avatar: Animate Any Portrait with Audio"},"content":{"rendered":"\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"631\" src=\"https:\/\/gaga.art\/blog\/wp-content\/uploads\/2026\/03\/hunyuanvideo-avatar-1024x631.webp\" alt=\"hunyuanvideo avatar\" class=\"wp-image-1897\" srcset=\"https:\/\/gaga.art\/blog\/wp-content\/uploads\/2026\/03\/hunyuanvideo-avatar-1024x631.webp 1024w, https:\/\/gaga.art\/blog\/wp-content\/uploads\/2026\/03\/hunyuanvideo-avatar-300x185.webp 300w, https:\/\/gaga.art\/blog\/wp-content\/uploads\/2026\/03\/hunyuanvideo-avatar-768x473.webp 768w, https:\/\/gaga.art\/blog\/wp-content\/uploads\/2026\/03\/hunyuanvideo-avatar-1536x947.webp 1536w, https:\/\/gaga.art\/blog\/wp-content\/uploads\/2026\/03\/hunyuanvideo-avatar-2048x1262.webp 2048w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p><\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"key-takeaways\" style=\"font-size:24px\"><strong>Key Takeaways<\/strong><\/h3>\n\n\n\n<p><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>HunyuanVideo-Avatar is an open-source, audio-driven human animation model from Tencent Hunyuan + Tencent Music Entertainment Lyra Lab, published on arXiv May 26, 2025 (arXiv:2505.20156).<\/li>\n\n\n\n<li>It is built on a Multimodal Diffusion Transformer (MM-DiT) architecture and introduces three novel modules: a Character Image Injection Module, an Audio Emotion Module (AEM), and a Face-Aware Audio Adapter (FAA).<\/li>\n\n\n\n<li>It supports multi-character dialogue animation \u2014 multiple avatars in a single scene driven by independent audio streams, isolated via latent-level face masks.<\/li>\n\n\n\n<li>Input: any portrait image (photorealistic, cartoon, 3D-rendered, anthropomorphic) at arbitrary resolution + audio. Output: a high-dynamic talking-head video with synced emotion.<\/li>\n\n\n\n<li>Minimum GPU: 24GB VRAM (slow). Recommended: 80\u201396GB. Runs on 10GB VRAM via TeaCache (thanks to Wan2GP).<\/li>\n\n\n\n<li>Available free: <a href=\"https:\/\/github.com\/Tencent-Hunyuan\/HunyuanVideo-Avatar\" rel=\"nofollow noopener\" target=\"_blank\">GitHub<\/a> (2K stars, 327 forks) \u00b7 <a href=\"https:\/\/huggingface.co\/tencent\/HunyuanVideo-Avatar\" rel=\"nofollow noopener\" target=\"_blank\">Hugging Face<\/a> (324 likes) \u00b7 <a href=\"https:\/\/hunyuan.tencent.com\/modelSquare\/home\/play?modelId=126\" rel=\"nofollow noopener\" target=\"_blank\">Live demo<\/a>.<\/li>\n\n\n\n<li>Bonus at the end: How Gaga AI pairs with HunyuanVideo-Avatar for image-to-video, audio infusion, voice cloning, and TTS.<\/li>\n<\/ul>\n\n\n\n<p><\/p>\n\n\n\n<div class=\"wp-block-rank-math-toc-block has-custom-cd-994-c-color has-text-color has-link-color wp-elements-ce8aee5894ee31fa7efa57372e4bc0c8\" id=\"rank-math-toc\"><p>Table of Contents<\/p><nav><ul><li><a href=\"#what-is-hunyuan-video-avatar\">What Is HunyuanVideo-Avatar?<\/a><\/li><li><a href=\"#why-audio-driven-avatar-animation-is-still-hard\">Why Audio-Driven Avatar Animation Is Still Hard<\/a><\/li><li><a href=\"#the-three-core-innovations-in-hunyuan-video-avatar\">The Three Core Innovations in HunyuanVideo-Avatar<\/a><\/li><li><a href=\"#what-hunyuan-video-avatar-can-generate\">What HunyuanVideo-Avatar Can Generate<\/a><\/li><li><a href=\"#system-requirements\">System Requirements<\/a><\/li><li><a href=\"#how-to-install-hunyuan-video-avatar-step-by-step\">How to Install HunyuanVideo-Avatar: Step-by-Step<\/a><\/li><li><a href=\"#how-to-run-inference\">How to Run Inference<\/a><\/li><li><a href=\"#common-problems-and-fixes\">Common Problems and Fixes<\/a><\/li><li><a href=\"#hunyuan-video-avatar-vs-other-talking-avatar-models\">HunyuanVideo-Avatar vs. Other Talking Avatar Models<\/a><\/li><li><a href=\"#bonus-go-further-with-gaga-ai\">Bonus: Go Further with Gaga AI<\/a><\/li><li><a href=\"#frequently-asked-questions\">Frequently Asked Questions<\/a><\/li><li><a href=\"#official-resources\">Official Resources<\/a><\/li><\/ul><\/nav><\/div>\n\n\n\n<p><\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"what-is-hunyuan-video-avatar\"><strong>What Is HunyuanVideo-Avatar?<\/strong><\/h2>\n\n\n\n<p><\/p>\n\n\n\n<p>HunyuanVideo-Avatar is an open-source AI model that animates a single portrait image into a realistic, emotion-controlled talking avatar video \u2014 driven entirely by an audio input.<\/p>\n\n\n\n<p><\/p>\n\n\n\n<figure class=\"wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio\"><div class=\"wp-block-embed__wrapper\">\n<iframe loading=\"lazy\" title=\"HunyuanVideo-Avatar: High-Fidelity Audio-Driven Human Animation for Multiple Characters\" width=\"500\" height=\"281\" src=\"https:\/\/www.youtube.com\/embed\/nehI9TuSb3A?feature=oembed\" frameborder=\"0\" allow=\"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share\" referrerpolicy=\"strict-origin-when-cross-origin\" allowfullscreen><\/iframe>\n<\/div><\/figure>\n\n\n\n<p><\/p>\n\n\n\n<p>The full title is <em>&#8220;HunyuanVideo-Avatar: High-Fidelity Audio-Driven Human Animation for Multiple Characters.&#8221;<\/em> It was jointly developed by Tencent Hunyuan and Tencent Music Entertainment Lyra Lab, submitted to arXiv on May 26, 2025 (v2 revised June 3, 2025).<\/p>\n\n\n\n<p>The model does three things that prior systems struggled with simultaneously:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Generates highly dynamic video (natural head movement, background motion) while keeping the character identity consistent frame to frame<\/li>\n\n\n\n<li>Controls facial emotion based on a separate emotion reference image \u2014 not just lip sync, but actual expressive emotion transfer<\/li>\n\n\n\n<li>Animates multiple characters in the same scene, each driven by an independent audio stream, without cross-character bleed<\/li>\n<\/ol>\n\n\n\n<p>Authors: Yi Chen, Sen Liang, Zixiang Zhou, Ziyao Huang, Yifeng Ma, Junshu Tang, Qin Lin, Yuan Zhou, Qinglin Lu<\/p>\n\n\n\n<p>Where to find it:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>\ud83d\udcbb <a href=\"https:\/\/github.com\/Tencent-Hunyuan\/HunyuanVideo-Avatar\" rel=\"nofollow noopener\" target=\"_blank\">GitHub: Tencent-Hunyuan\/HunyuanVideo-Avatar<\/a> \u2014 2K stars, 327 forks<\/li>\n\n\n\n<li>\ud83e\udd17 <a href=\"https:\/\/huggingface.co\/tencent\/HunyuanVideo-Avatar\" rel=\"nofollow noopener\" target=\"_blank\">Hugging Face: tencent\/HunyuanVideo-Avatar<\/a> \u2014 324 likes<\/li>\n\n\n\n<li>\ud83c\udfae <a href=\"https:\/\/hunyuan.tencent.com\/modelSquare\/home\/play?modelId=126\" rel=\"nofollow noopener\" target=\"_blank\">Live demo (Tencent Hunyuan platform)<\/a><\/li>\n\n\n\n<li>\ud83d\udcc4 <a href=\"https:\/\/arxiv.org\/abs\/2505.20156\" rel=\"nofollow noopener\" target=\"_blank\">arXiv: 2505.20156<\/a><\/li>\n\n\n\n<li>\ud83c\udf10 <a href=\"https:\/\/hunyuanvideo-avatar.github.io\/\" rel=\"nofollow noopener\" target=\"_blank\">Project page with video demos<\/a><\/li>\n<\/ul>\n\n\n\n<p><\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"why-audio-driven-avatar-animation-is-still-hard\"><strong>Why Audio-Driven Avatar Animation Is Still Hard<\/strong><\/h2>\n\n\n\n<p><\/p>\n\n\n\n<p>Generating a talking avatar that looks and feels real requires solving three problems at once \u2014 and most existing models only solve one or two.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"problem-1-dynamic-motion-vs-character-consistency\" style=\"font-size:24px\"><strong>Problem 1: Dynamic Motion vs. Character Consistency<\/strong><\/h3>\n\n\n\n<p><\/p>\n\n\n\n<p>Most audio-driven avatar models face a trade-off: the more dynamic the motion (natural head movement, body sway, background life), the more the character&#8217;s face drifts from the original image over time. Prior methods used simple &#8220;addition-based character conditioning&#8221; \u2014 adding the reference image feature directly to the noise signal. This creates a mismatch between how the reference is used during training vs. inference, leading to identity drift in long sequences.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"problem-2-emotion-alignment\" style=\"font-size:24px\"><strong>Problem 2: Emotion Alignment<\/strong><\/h3>\n\n\n\n<p><\/p>\n\n\n\n<p>Lip sync is solved at a basic level by most models. What isn&#8217;t solved: making the character <em>look like they feel<\/em> what they&#8217;re saying. Sad audio should produce a sad expression; excited audio should produce an energetic face. This requires extracting and transferring emotional cues \u2014 not just phoneme alignment.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"problem-3-multi-character-scenes\" style=\"font-size:24px\"><strong>Problem 3: Multi-Character Scenes<\/strong><\/h3>\n\n\n\n<p><\/p>\n\n\n\n<p>When multiple people are in a single frame, each driven by their own audio stream, the model needs to know which audio belongs to which face \u2014 and ensure the facial motion of character A doesn&#8217;t influence character B. This &#8220;audio bleed&#8221; problem is not addressed at all in most single-character avatar models.<\/p>\n\n\n\n<p>HunyuanVideo-Avatar was specifically designed to solve all three problems, which is why the architecture introduces three new modules rather than refining an existing one.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"the-three-core-innovations-in-hunyuan-video-avatar\"><strong>The Three Core Innovations in HunyuanVideo-Avatar<\/strong><\/h2>\n\n\n\n<p><\/p>\n\n\n\n<p>HunyuanVideo-Avatar&#8217;s architecture is built on MM-DiT (Multimodal Diffusion Transformer) and introduces three new modules \u2014 each solving one of the three core problems above.<\/p>\n\n\n\n<p><\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"782\" src=\"https:\/\/gaga.art\/blog\/wp-content\/uploads\/2026\/03\/hunyuanvideo-avatar-features-1024x782.webp\" alt=\"hunyuanvideo avatar features\" class=\"wp-image-1896\" srcset=\"https:\/\/gaga.art\/blog\/wp-content\/uploads\/2026\/03\/hunyuanvideo-avatar-features-1024x782.webp 1024w, https:\/\/gaga.art\/blog\/wp-content\/uploads\/2026\/03\/hunyuanvideo-avatar-features-300x229.webp 300w, https:\/\/gaga.art\/blog\/wp-content\/uploads\/2026\/03\/hunyuanvideo-avatar-features-768x587.webp 768w, https:\/\/gaga.art\/blog\/wp-content\/uploads\/2026\/03\/hunyuanvideo-avatar-features-1536x1173.webp 1536w, https:\/\/gaga.art\/blog\/wp-content\/uploads\/2026\/03\/hunyuanvideo-avatar-features-2048x1564.webp 2048w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p><\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"1-character-image-injection-module\" style=\"font-size:24px\"><strong>1. Character Image Injection Module<\/strong><\/h3>\n\n\n\n<p><\/p>\n\n\n\n<p>What it solves: Character identity drift in dynamic video.<\/p>\n\n\n\n<p>What it does: Replaces the conventional addition-based character conditioning with a dedicated injection module. Instead of adding the character image features to the latent noise (which creates a training-inference mismatch), the injection module conditions the diffusion process in a way that&#8217;s consistent between training and inference.<\/p>\n\n\n\n<p>Result: The character&#8217;s face, proportions, and style remain stable across all frames \u2014 even when the background moves dynamically and the head rotates significantly.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"2-audio-emotion-module-aem\" style=\"font-size:24px\"><strong>2. Audio Emotion Module (AEM)<\/strong><\/h3>\n\n\n\n<p><\/p>\n\n\n\n<p>What it solves: Emotion alignment between audio content and facial expression.<\/p>\n\n\n\n<p>What it does: The AEM extracts emotional cues from a separate emotion reference image provided by the user. It doesn&#8217;t interpret emotion from audio waveforms directly (audio emotion recognition is imprecise); instead, it uses a visual reference to define the target emotional style, then transfers that style into the generated video conditioned on the audio.<\/p>\n\n\n\n<p>Result: Fine-grained, accurate emotion control. A user can specify &#8220;this character should look sad&#8221; by providing a sad reference image \u2014 and the generated video will reflect that emotion while still being driven by the audio for lip sync.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"3-face-aware-audio-adapter-faa\" style=\"font-size:24px\"><strong>3. Face-Aware Audio Adapter (FAA)<\/strong><\/h3>\n\n\n\n<p><\/p>\n\n\n\n<p>What it solves: Multi-character audio bleed.<\/p>\n\n\n\n<p>What it does: The FAA isolates each character using a latent-level face mask \u2014 a spatial mask applied in the diffusion model&#8217;s latent space that corresponds to each character&#8217;s face region. Each character receives their audio signal via an independent cross-attention mechanism, routed through their own face mask. This prevents audio signal for character A from influencing the facial motion of character B.<\/p>\n\n\n\n<p>Result: True multi-character animation \u2014 two or more avatars in the same video frame, each with independent lip sync and motion, without interference.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"what-hunyuan-video-avatar-can-generate\"><strong>What HunyuanVideo-Avatar Can Generate<\/strong><\/h2>\n\n\n\n<p><\/p>\n\n\n\n<p>HunyuanVideo-Avatar accepts any portrait image style and generates video at multiple scales \u2014 portrait, upper-body, or full-body.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"supported-input-styles\" style=\"font-size:24px\"><strong>Supported Input Styles<\/strong><\/h3>\n\n\n\n<p><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Photorealistic \u2014 standard photography portraits<\/li>\n\n\n\n<li>Cartoon \/ Illustrated \u2014 2D anime, flat-art, comic styles<\/li>\n\n\n\n<li>3D Rendered \u2014 CGI characters, game assets<\/li>\n\n\n\n<li>Anthropomorphic \u2014 non-human characters with human-like features<\/li>\n<\/ul>\n\n\n\n<p><\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"supported-output-scales\" style=\"font-size:24px\"><strong>Supported Output Scales<\/strong><\/h3>\n\n\n\n<p><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Portrait \u2014 face and neck, tight frame<\/li>\n\n\n\n<li>Upper-body \u2014 shoulders and above<\/li>\n\n\n\n<li>Full-body \u2014 complete figure with environment<\/li>\n<\/ul>\n\n\n\n<p><\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"key-generation-characteristics\" style=\"font-size:24px\"><strong>Key Generation Characteristics<\/strong><\/h3>\n\n\n\n<p><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>High-dynamic foreground and background \u2014 not a static background with a moving face<\/li>\n\n\n\n<li>Arbitrary input resolution and scale (no fixed crop requirement)<\/li>\n\n\n\n<li>Emotion-controllable via reference image<\/li>\n\n\n\n<li>Multi-character support via FAA<\/li>\n<\/ul>\n\n\n\n<p><\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"use-cases\" style=\"font-size:24px\"><strong>Use Cases<\/strong><\/h3>\n\n\n\n<p><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>E-commerce \u2014 product presenters, virtual salespeople<\/li>\n\n\n\n<li>Online streaming \u2014 virtual streamers and VTubers<\/li>\n\n\n\n<li>Social media content \u2014 animated profile videos, scripted short-form<\/li>\n\n\n\n<li>Video content creation and editing \u2014 dialogue scenes with multiple AI characters<\/li>\n\n\n\n<li>Education and training \u2014 AI instructors from any portrait<\/li>\n\n\n\n<li>Entertainment \u2014 animated historical figures, fictional characters<\/li>\n<\/ul>\n\n\n\n<p><\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"system-requirements\"><strong>System Requirements<\/strong><\/h2>\n\n\n\n<p><\/p>\n\n\n\n<p>HunyuanVideo-Avatar requires an NVIDIA GPU with CUDA support. The minimum for any inference is 24GB VRAM; the recommended setup for quality output is 80\u201396GB VRAM.<\/p>\n\n\n\n<p><\/p>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><tbody><tr><td><strong>Setup<\/strong><\/td><td><strong>GPU Memory<\/strong><\/td><td><strong>Speed<\/strong><\/td><td><strong>Notes<\/strong><\/td><\/tr><tr><td>Minimum (standard)<\/td><td>24GB<\/td><td>Very slow<\/td><td>704\u00d7768, 129 frames<\/td><\/tr><tr><td>Recommended<\/td><td>80\u201396GB<\/td><td>Production speed<\/td><td>Full quality output<\/td><\/tr><tr><td>Low VRAM (FP8 + CPU offload)<\/td><td>&lt; 24GB<\/td><td>Slow<\/td><td>&#8211;cpu-offload flag<\/td><\/tr><tr><td>Ultra-low VRAM (TeaCache)<\/td><td>10GB<\/td><td>Moderate<\/td><td>Via Wan2GP integration<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p><\/p>\n\n\n\n<p>Tip: If you experience OOM (out-of-memory) errors on an 80GB GPU, reduce input image resolution. The model was tested on an 8-GPU machine.<\/p>\n\n\n\n<p>Software requirements:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>OS: Linux (tested; Windows not officially supported)<\/li>\n\n\n\n<li>CUDA: 12.4 (recommended) or 11.8<\/li>\n\n\n\n<li>Python: 3.10.9<\/li>\n\n\n\n<li>PyTorch: 2.4.0<\/li>\n\n\n\n<li>Flash Attention v2.6.3 (for acceleration)<\/li>\n<\/ul>\n\n\n\n<p><\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"how-to-install-hunyuan-video-avatar-step-by-step\"><strong>How to Install HunyuanVideo-Avatar: Step-by-Step<\/strong><\/h2>\n\n\n\n<p><\/p>\n\n\n\n<p>HunyuanVideo-Avatar installs via conda + pip on Linux. A Docker image is also available for the fastest path to inference.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"option-a-install-from-scratch-linux\" style=\"font-size:24px\"><strong>Option A: Install from Scratch (Linux)<\/strong><\/h3>\n\n\n\n<p><\/p>\n\n\n\n<p class=\"has-vivid-green-cyan-color has-text-color has-link-color wp-elements-bd4b2bb26709fcc03612c8ada31a59a9\">Step 1 \u2014 Clone the repository:<\/p>\n\n\n\n<p>git clone https:\/\/github.com\/Tencent-Hunyuan\/HunyuanVideo-Avatar.git<\/p>\n\n\n\n<p>cd HunyuanVideo-Avatar<\/p>\n\n\n\n<p class=\"has-vivid-green-cyan-color has-text-color has-link-color wp-elements-19aa00e319acf798516ccc81827a974c\">Step 2 \u2014 Create and activate the conda environment:<\/p>\n\n\n\n<p>conda create -n HunyuanVideo-Avatar python==3.10.9<\/p>\n\n\n\n<p>conda activate HunyuanVideo-Avatar<\/p>\n\n\n\n<p class=\"has-vivid-green-cyan-color has-text-color has-link-color wp-elements-c1743e5080b3f8b25ba782a1e7137340\">Step 3 \u2014 Install PyTorch:<\/p>\n\n\n\n<p># For CUDA 12.4 (recommended)<\/p>\n\n\n\n<p>conda install pytorch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 pytorch-cuda=12.4 -c pytorch -c nvidia<\/p>\n\n\n\n<p># For CUDA 11.8<\/p>\n\n\n\n<p>conda install pytorch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 pytorch-cuda=11.8 -c pytorch -c nvidia<\/p>\n\n\n\n<p class=\"has-vivid-green-cyan-color has-text-color has-link-color wp-elements-4eb1a7342ec5a59ef77ec01a6c7370f7\">Step 4 \u2014 Install pip dependencies:<\/p>\n\n\n\n<p>python -m pip install -r requirements.txt<\/p>\n\n\n\n<p class=\"has-vivid-green-cyan-color has-text-color has-link-color wp-elements-537dc0385bcb42f2aed466a8e71f01b0\">Step 5 \u2014 Install Flash Attention (recommended for speed):<\/p>\n\n\n\n<p>python -m pip install ninja<\/p>\n\n\n\n<p>python -m pip install git+https:\/\/github.com\/Dao-AILab\/flash-attention.git@v2.6.3<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"option-b-docker-fastest-setup\" style=\"font-size:24px\"><strong>Option B: Docker (Fastest Setup)<\/strong><\/h3>\n\n\n\n<p><\/p>\n\n\n\n<p># CUDA 12.4 (recommended \u2014 avoids float point exceptions)<\/p>\n\n\n\n<p>docker pull hunyuanvideo\/hunyuanvideo:cuda_12<\/p>\n\n\n\n<p>docker run -itd &#8211;gpus all &#8211;init &#8211;net=host &#8211;uts=host &#8211;ipc=host \\<\/p>\n\n\n\n<p>&nbsp;&nbsp;&#8211;name hunyuanvideo &#8211;security-opt=seccomp=unconfined \\<\/p>\n\n\n\n<p>&nbsp;&nbsp;&#8211;ulimit=stack=67108864 &#8211;ulimit=memlock=-1 &#8211;privileged \\<\/p>\n\n\n\n<p>&nbsp;&nbsp;hunyuanvideo\/hunyuanvideo:cuda_12<\/p>\n\n\n\n<p><\/p>\n\n\n\n<p># Then install additional packages inside the container:<\/p>\n\n\n\n<p>pip install gradio==3.39.0 diffusers==0.33.0 transformers==4.41.2<\/p>\n\n\n\n<p>For CUDA 11.8, replace cuda_12 with cuda_11 in the pull command.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"step-6-download-pretrained-model-weights\" style=\"font-size:24px\"><strong>Step 6 \u2014 Download Pretrained Model Weights<\/strong><\/h3>\n\n\n\n<p><\/p>\n\n\n\n<p>Follow the download instructions in the <a href=\"https:\/\/github.com\/Tencent-Hunyuan\/HunyuanVideo-Avatar\/tree\/main\/weights\" rel=\"nofollow noopener\" target=\"_blank\">official README weights section<\/a>. Place the downloaded checkpoints in the .\/weights directory as specified.<\/p>\n\n\n\n<p>The checkpoint structure expected:<\/p>\n\n\n\n<p>weights\/<\/p>\n\n\n\n<p>&nbsp;&nbsp;\u2514\u2500\u2500 ckpts\/<\/p>\n\n\n\n<p>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;\u2514\u2500\u2500 hunyuan-video-t2v-720p\/<\/p>\n\n\n\n<p>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;\u2514\u2500\u2500 transformers\/<\/p>\n\n\n\n<p>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;\u251c\u2500\u2500 mp_rank_00_model_states.pt &nbsp; &nbsp; &nbsp; \u2190 full precision<\/p>\n\n\n\n<p>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;\u2514\u2500\u2500 mp_rank_00_model_states_fp8.pt &nbsp; \u2190 FP8 (single GPU)<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"how-to-run-inference\"><strong>How to Run Inference<\/strong><\/h2>\n\n\n\n<p><\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"multi-gpu-inference-recommended-for-quality\" style=\"font-size:24px\"><strong>Multi-GPU Inference (Recommended for Quality)<\/strong><\/h3>\n\n\n\n<p><\/p>\n\n\n\n<p>8-GPU parallel inference \u2014 best output quality:<\/p>\n\n\n\n<p>cd HunyuanVideo-Avatar<\/p>\n\n\n\n<p>export PYTHONPATH=.\/<\/p>\n\n\n\n<p>export MODEL_BASE=&#8221;.\/weights&#8221;<\/p>\n\n\n\n<p>checkpoint_path=${MODEL_BASE}\/ckpts\/hunyuan-video-t2v-720p\/transformers\/mp_rank_00_model_states.pt<\/p>\n\n\n\n<p>torchrun &#8211;nnodes=1 &#8211;nproc_per_node=8 &#8211;master_port 29605 hymm_sp\/sample_batch.py \\<\/p>\n\n\n\n<p>&nbsp;&nbsp;&nbsp;&nbsp;&#8211;input &#8216;assets\/test.csv&#8217; \\<\/p>\n\n\n\n<p>&nbsp;&nbsp;&nbsp;&nbsp;&#8211;ckpt ${checkpoint_path} \\<\/p>\n\n\n\n<p>&nbsp;&nbsp;&nbsp;&nbsp;&#8211;sample-n-frames 129 \\<\/p>\n\n\n\n<p>&nbsp;&nbsp;&nbsp;&nbsp;&#8211;seed 128 \\<\/p>\n\n\n\n<p>&nbsp;&nbsp;&nbsp;&nbsp;&#8211;image-size 704 \\<\/p>\n\n\n\n<p>&nbsp;&nbsp;&nbsp;&nbsp;&#8211;cfg-scale 7.5 \\<\/p>\n\n\n\n<p>&nbsp;&nbsp;&nbsp;&nbsp;&#8211;infer-steps 50 \\<\/p>\n\n\n\n<p>&nbsp;&nbsp;&nbsp;&nbsp;&#8211;use-deepcache 1 \\<\/p>\n\n\n\n<p>&nbsp;&nbsp;&nbsp;&nbsp;&#8211;flow-shift-eval-video 5.0 \\<\/p>\n\n\n\n<p>&nbsp;&nbsp;&nbsp;&nbsp;&#8211;save-path .\/results<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"single-gpu-inference-fp-8-mode\" style=\"font-size:24px\"><strong>Single-GPU Inference (FP8 Mode)<\/strong><\/h3>\n\n\n\n<p><\/p>\n\n\n\n<p>export PYTHONPATH=.\/<\/p>\n\n\n\n<p>export MODEL_BASE=.\/weights<\/p>\n\n\n\n<p>export DISABLE_SP=1<\/p>\n\n\n\n<p>checkpoint_path=${MODEL_BASE}\/ckpts\/hunyuan-video-t2v-720p\/transformers\/mp_rank_00_model_states_fp8.pt<\/p>\n\n\n\n<p>CUDA_VISIBLE_DEVICES=0 python3 hymm_sp\/sample_gpu_poor.py \\<\/p>\n\n\n\n<p>&nbsp;&nbsp;&nbsp;&nbsp;&#8211;input &#8216;assets\/test.csv&#8217; \\<\/p>\n\n\n\n<p>&nbsp;&nbsp;&nbsp;&nbsp;&#8211;ckpt ${checkpoint_path} \\<\/p>\n\n\n\n<p>&nbsp;&nbsp;&nbsp;&nbsp;&#8211;sample-n-frames 129 \\<\/p>\n\n\n\n<p>&nbsp;&nbsp;&nbsp;&nbsp;&#8211;seed 128 \\<\/p>\n\n\n\n<p>&nbsp;&nbsp;&nbsp;&nbsp;&#8211;image-size 704 \\<\/p>\n\n\n\n<p>&nbsp;&nbsp;&nbsp;&nbsp;&#8211;cfg-scale 7.5 \\<\/p>\n\n\n\n<p>&nbsp;&nbsp;&nbsp;&nbsp;&#8211;infer-steps 50 \\<\/p>\n\n\n\n<p>&nbsp;&nbsp;&nbsp;&nbsp;&#8211;use-deepcache 1 \\<\/p>\n\n\n\n<p>&nbsp;&nbsp;&nbsp;&nbsp;&#8211;flow-shift-eval-video 5.0 \\<\/p>\n\n\n\n<p>&nbsp;&nbsp;&nbsp;&nbsp;&#8211;save-path .\/results-single \\<\/p>\n\n\n\n<p>&nbsp;&nbsp;&nbsp;&nbsp;&#8211;use-fp8 \\<\/p>\n\n\n\n<p>&nbsp;&nbsp;&nbsp;&nbsp;&#8211;infer-min<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"very-low-vram-mode-cpu-offload\" style=\"font-size:24px\"><strong>Very Low VRAM Mode (CPU Offload)<\/strong><\/h3>\n\n\n\n<p><\/p>\n\n\n\n<p>export CPU_OFFLOAD=1<\/p>\n\n\n\n<p>CUDA_VISIBLE_DEVICES=0 python3 hymm_sp\/sample_gpu_poor.py \\<\/p>\n\n\n\n<p>&nbsp;&nbsp;&nbsp;&nbsp;&#8211;input &#8216;assets\/test.csv&#8217; \\<\/p>\n\n\n\n<p>&nbsp;&nbsp;&nbsp;&nbsp;&#8211;ckpt ${checkpoint_path} \\<\/p>\n\n\n\n<p>&nbsp;&nbsp;&nbsp;&nbsp;&#8211;sample-n-frames 129 \\<\/p>\n\n\n\n<p>&nbsp;&nbsp;&nbsp;&nbsp;&#8211;seed 128 \\<\/p>\n\n\n\n<p>&nbsp;&nbsp;&nbsp;&nbsp;&#8211;image-size 704 \\<\/p>\n\n\n\n<p>&nbsp;&nbsp;&nbsp;&nbsp;&#8211;cfg-scale 7.5 \\<\/p>\n\n\n\n<p>&nbsp;&nbsp;&nbsp;&nbsp;&#8211;infer-steps 50 \\<\/p>\n\n\n\n<p>&nbsp;&nbsp;&nbsp;&nbsp;&#8211;use-deepcache 1 \\<\/p>\n\n\n\n<p>&nbsp;&nbsp;&nbsp;&nbsp;&#8211;flow-shift-eval-video 5.0 \\<\/p>\n\n\n\n<p>&nbsp;&nbsp;&nbsp;&nbsp;&#8211;save-path .\/results-poor \\<\/p>\n\n\n\n<p>&nbsp;&nbsp;&nbsp;&nbsp;&#8211;use-fp8 \\<\/p>\n\n\n\n<p>&nbsp;&nbsp;&nbsp;&nbsp;&#8211;cpu-offload \\<\/p>\n\n\n\n<p>&nbsp;&nbsp;&nbsp;&nbsp;&#8211;infer-min<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"10-gb-vram-mode-tea-cache-via-wan-2-gp\" style=\"font-size:24px\"><strong>10GB VRAM Mode (TeaCache via Wan2GP)<\/strong><\/h3>\n\n\n\n<p><\/p>\n\n\n\n<p>HunyuanVideo-Avatar now supports 10GB VRAM single-GPU inference thanks to the <a href=\"https:\/\/github.com\/Tencent-Hunyuan\/HunyuanVideo-Avatar\" rel=\"nofollow noopener\" target=\"_blank\">Wan2GP<\/a> integration with TeaCache. Check the Wan2GP repository for the specific launch instructions \u2014 no quality degradation reported.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"gradio-web-ui\" style=\"font-size:24px\"><strong>Gradio Web UI<\/strong><\/h3>\n\n\n\n<p><\/p>\n\n\n\n<p>cd HunyuanVideo-Avatar<\/p>\n\n\n\n<p>bash .\/scripts\/run_gradio.sh<\/p>\n\n\n\n<p>Opens a local browser UI. No .csv file preparation required \u2014 you interact directly with the interface.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"input-format\" style=\"font-size:24px\"><strong>Input Format<\/strong><\/h3>\n\n\n\n<p><\/p>\n\n\n\n<p>Inference input uses a .csv file (assets\/test.csv) specifying:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Portrait image path<\/li>\n\n\n\n<li>Audio file path<\/li>\n\n\n\n<li>(Optional) Emotion reference image path<\/li>\n\n\n\n<li>Output resolution and frame count settings<\/li>\n<\/ul>\n\n\n\n<p><\/p>\n\n\n\n<p>Refer to the assets\/ folder in the repo for working examples before modifying for your own content.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"common-problems-and-fixes\"><strong>Common Problems and Fixes<\/strong><\/h2>\n\n\n\n<p><\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"float-point-exception-core-dump-on-startup\" style=\"font-size:24px\"><strong>Float Point Exception (Core Dump) on Startup<\/strong><\/h3>\n\n\n\n<p><\/p>\n\n\n\n<p>Cause: cuBLAS \/ cuDNN version mismatch on certain GPU types.<\/p>\n\n\n\n<p>Fix Option 1 \u2014 Update cuBLAS:<\/p>\n\n\n\n<p>pip install nvidia-cublas-cu12==12.4.5.8<\/p>\n\n\n\n<p>export LD_LIBRARY_PATH=\/opt\/conda\/lib\/python3.8\/site-packages\/nvidia\/cublas\/lib\/<\/p>\n\n\n\n<p><\/p>\n\n\n\n<p>Fix Option 2 \u2014 Force CUDA 11.8 packages:<\/p>\n\n\n\n<p>pip uninstall -r requirements.txt<\/p>\n\n\n\n<p>pip install torch==2.4.0 &#8211;index-url https:\/\/download.pytorch.org\/whl\/cu118<\/p>\n\n\n\n<p>pip install -r requirements.txt<\/p>\n\n\n\n<p>pip install ninja<\/p>\n\n\n\n<p>pip install git+https:\/\/github.com\/Dao-AILab\/flash-attention.git@v2.6.3<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"out-of-memory-oom-on-80-gb-gpu\" style=\"font-size:24px\"><strong>Out of Memory (OOM) on 80GB GPU<\/strong><\/h3>\n\n\n\n<p><\/p>\n\n\n\n<p>Cause: Input image resolution is too high for the available VRAM budget.<\/p>\n\n\n\n<p>Fix: Reduce image resolution before passing as input. The model has no fixed minimum resolution, so scaling down the portrait image directly reduces peak VRAM usage.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"character-identity-drifts-mid-video\" style=\"font-size:24px\"><strong>Character Identity Drifts Mid-Video<\/strong><\/h3>\n\n\n\n<p><\/p>\n\n\n\n<p>Cause: This was the core problem in prior models, which the Character Image Injection Module addresses. If drift still occurs, it likely indicates the reference image has conflicting or ambiguous identity signals (e.g., heavy post-processing, non-frontal poses).<\/p>\n\n\n\n<p>Fix: Use a clean, front-facing, well-lit portrait as the character reference. Avoid heavily filtered or stylized images unless the output style is intentionally artistic.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"multi-character-audio-bleed\" style=\"font-size:24px\"><strong>Multi-Character Audio Bleed<\/strong><\/h3>\n\n\n\n<p><\/p>\n\n\n\n<p>Cause: Incorrect face mask assignment in the input CSV, or face regions overlapping significantly.<\/p>\n\n\n\n<p>Fix: Ensure each character&#8217;s face region is correctly defined in the input. The FAA uses latent-level face masks \u2014 verify the spatial coordinates match each character&#8217;s face position in the frame.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"hunyuan-video-avatar-vs-other-talking-avatar-models\"><strong>HunyuanVideo-Avatar vs. Other Talking Avatar Models<\/strong><\/h2>\n\n\n\n<p><\/p>\n\n\n\n<p>HunyuanVideo-Avatar is the only open-source model that simultaneously addresses high-dynamic generation, emotion controllability, and multi-character animation in a single framework.<\/p>\n\n\n\n<p><\/p>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><tbody><tr><td><strong>Feature<\/strong><\/td><td><strong>HunyuanVideo-Avatar<\/strong><\/td><td><strong>SadTalker<\/strong><\/td><td><strong>EMO<\/strong><\/td><td><strong>AniPortrait<\/strong><\/td><\/tr><tr><td>Architecture<\/td><td>MM-DiT<\/td><td>3DMM + diffusion<\/td><td>Diffusion<\/td><td>Diffusion<\/td><\/tr><tr><td>Multi-character support<\/td><td>\u2705 Yes (FAA)<\/td><td>\u274c No<\/td><td>\u274c No<\/td><td>\u274c No<\/td><\/tr><tr><td>Emotion reference control<\/td><td>\u2705 Yes (AEM)<\/td><td>Limited<\/td><td>Limited<\/td><td>\u274c No<\/td><\/tr><tr><td>Character identity stability<\/td><td>\u2705 Strong (injection module)<\/td><td>Moderate<\/td><td>Good<\/td><td>Moderate<\/td><\/tr><tr><td>Input style variety<\/td><td>\u2705 Photo\/cartoon\/3D\/anthro<\/td><td>Photo mainly<\/td><td>Photo mainly<\/td><td>Photo mainly<\/td><\/tr><tr><td>Min VRAM<\/td><td>10GB (TeaCache)<\/td><td>~6GB<\/td><td>~16GB<\/td><td>~8GB<\/td><\/tr><tr><td>Open-source<\/td><td>\u2705 Yes<\/td><td>\u2705 Yes<\/td><td>\u274c No<\/td><td>\u2705 Yes<\/td><\/tr><tr><td>Live demo available<\/td><td>\u2705 Yes<\/td><td>Limited<\/td><td>\u274c<\/td><td>\u274c<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p><\/p>\n\n\n\n<p>The gap is largest on multi-character support \u2014 no other open-source model in this space handles it at the architecture level.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"bonus-go-further-with-gaga-ai\"><strong>Bonus: Go Further with Gaga AI<\/strong><\/h2>\n\n\n\n<p><\/p>\n\n\n\n<p>HunyuanVideo-Avatar animates your portrait with audio. <a href=\"https:\/\/gaga.art\/en\/\">Gaga AI<\/a> takes that output and builds a full video production around it \u2014 adding generated backgrounds, original audio, a cloned voice, and TTS narration.<\/p>\n\n\n\n<p><\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"623\" src=\"https:\/\/gaga.art\/blog\/wp-content\/uploads\/2026\/02\/gaga-ai-video-generation-1024x623.webp\" alt=\"gaga ai video generation\" class=\"wp-image-1426\" srcset=\"https:\/\/gaga.art\/blog\/wp-content\/uploads\/2026\/02\/gaga-ai-video-generation-1024x623.webp 1024w, https:\/\/gaga.art\/blog\/wp-content\/uploads\/2026\/02\/gaga-ai-video-generation-300x183.webp 300w, https:\/\/gaga.art\/blog\/wp-content\/uploads\/2026\/02\/gaga-ai-video-generation-768x467.webp 768w, https:\/\/gaga.art\/blog\/wp-content\/uploads\/2026\/02\/gaga-ai-video-generation-1536x935.webp 1536w, https:\/\/gaga.art\/blog\/wp-content\/uploads\/2026\/02\/gaga-ai-video-generation-2048x1246.webp 2048w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p><\/p>\n\n\n\n<p>The four Gaga AI modules that pair most directly with a HunyuanVideo-Avatar workflow:<\/p>\n\n\n\n<p><\/p>\n\n\n\n<div class=\"wp-block-buttons is-content-justification-center is-layout-flex wp-container-core-buttons-is-layout-a89b3969 wp-block-buttons-is-layout-flex\">\n<div class=\"wp-block-button\"><a class=\"wp-block-button__link wp-element-button\" href=\"http:\/\/gaga.art\/app\" target=\"_blank\" rel=\"noreferrer noopener\">Generate Video Free<\/a><\/div>\n\n\n\n<div class=\"wp-block-button\"><a class=\"wp-block-button__link wp-element-button\" href=\"https:\/\/gaga.art\/\">Learn Gaga AI<\/a><\/div>\n<\/div>\n\n\n\n<p><\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"image-to-video-ai\" style=\"font-size:24px\"><strong>Image-to-Video AI<\/strong><\/h3>\n\n\n\n<p><\/p>\n\n\n\n<p><a href=\"https:\/\/gaga.art\/en\/image-to-video-ai\">Gaga AI&#8217;s image-to-video engine<\/a> generates a cinematic motion sequence from a single still image \u2014 including AI-generated backgrounds to composite behind your HunyuanVideo-Avatar character.<\/p>\n\n\n\n<p>Take a background scene image \u2014 a studio, an outdoor environment, a branded space \u2014 and prompt Gaga AI to animate it: <em>&#8220;slow zoom into a softly lit studio interior&#8221;<\/em> or <em>&#8220;city skyline at sunset with subtle wind motion.&#8221;<\/em> The result is a moving background that wraps your avatar in a fully dynamic scene.<\/p>\n\n\n\n<p>Best for: Music video backdrops, branded content, social media video series, virtual studio setups.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"video-and-audio-infusion\" style=\"font-size:24px\"><strong>Video and Audio Infusion<\/strong><\/h3>\n\n\n\n<p><\/p>\n\n\n\n<p>Gaga AI analyzes the combined video output from HunyuanVideo-Avatar and generates or binds synchronized audio \u2014 ambient sound, music, or environmental effects matched to the visual scene.<\/p>\n\n\n\n<p>After the avatar animation is rendered, Gaga AI&#8217;s audio infusion layer adds the surrounding audio context: the hum of a crowd, background music, or environmental sound. This transforms an audio-driven avatar clip into a complete audiovisual experience.<\/p>\n\n\n\n<p>Best for: Final production assembly, social video publishing, short-form content at scale.<\/p>\n\n\n\n<figure class=\"wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio\"><div class=\"wp-block-embed__wrapper\">\n<iframe loading=\"lazy\" title=\"GAGA 1 PR Video\" width=\"500\" height=\"281\" src=\"https:\/\/www.youtube.com\/embed\/LlqfALVP-YI?feature=oembed\" frameborder=\"0\" allow=\"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share\" referrerpolicy=\"strict-origin-when-cross-origin\" allowfullscreen><\/iframe>\n<\/div><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"ai-avatar-gaga-a-is-own-avatar-system\" style=\"font-size:24px\"><strong>AI Avatar (Gaga AI&#8217;s own avatar system)<\/strong><\/h3>\n\n\n\n<p><\/p>\n\n\n\n<p>Gaga AI offers its own avatar generation system \u2014 complementary to HunyuanVideo-Avatar \u2014 for cases where you need a synthetic presenter generated from scratch rather than animated from a photo.<\/p>\n\n\n\n<p>Where HunyuanVideo-Avatar starts from a real portrait image and animates it, Gaga AI&#8217;s avatar system creates the presenter from text or minimal reference, then drives it with audio. The two tools cover different ends of the same workflow spectrum.<\/p>\n\n\n\n<p>Best for: Cases where no source portrait exists, AI influencer creation, scalable presenter content.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"ai-voice-clone-tts\" style=\"font-size:24px\"><strong>AI Voice Clone + TTS<\/strong><\/h3>\n\n\n\n<p><\/p>\n\n\n\n<p>Gaga AI clones a target speaker&#8217;s voice from a short audio sample and generates narration, dialogue, or commentary in that voice from any text input.<\/p>\n\n\n\n<p>Workflow with HunyuanVideo-Avatar:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Record 30\u201360 seconds of your presenter&#8217;s voice<\/li>\n\n\n\n<li>Clone the voice in Gaga AI<\/li>\n\n\n\n<li>Write the script as text<\/li>\n\n\n\n<li>Generate narration in the cloned voice<\/li>\n\n\n\n<li>Feed that audio into HunyuanVideo-Avatar as the driving audio input<\/li>\n\n\n\n<li>The avatar&#8217;s lips and emotion sync to the cloned voice output<\/li>\n<\/ol>\n\n\n\n<p>This is the pipeline for creating fully synthetic presenter videos where the speaker never re-records after the initial voice sample \u2014 including for multilingual versions.<\/p>\n\n\n\n<p>Best for: Multilingual content production, AI spokesperson series, scalable narration at volume.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"the-complete-pipeline\" style=\"font-size:24px\"><strong>The Complete Pipeline<\/strong><\/h3>\n\n\n\n<p><\/p>\n\n\n\n<p>Portrait Image + Script Text<\/p>\n\n\n\n<p>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;\u2193<\/p>\n\n\n\n<p>Gaga AI Voice Clone + TTS \u2500\u2500 Generate natural narration in cloned voice<\/p>\n\n\n\n<p>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;\u2193<\/p>\n\n\n\n<p>HunyuanVideo-Avatar \u2500\u2500 Animate portrait with AEM emotion + FAA multi-char<\/p>\n\n\n\n<p>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;\u2193<\/p>\n\n\n\n<p>Gaga AI Image-to-Video \u2500\u2500 Animate the background scene<\/p>\n\n\n\n<p>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;\u2193<\/p>\n\n\n\n<p>Gaga AI Audio Infusion \u2500\u2500 Add ambient audio, music, environmental sound<\/p>\n\n\n\n<p>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;\u2193<\/p>\n\n\n\n<p>Final Video \u2500\u2500 Studio-quality AI presenter content, no studio required<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"frequently-asked-questions\"><strong>Frequently Asked Questions<\/strong><\/h2>\n\n\n\n<p><\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"what-is-hunyuan-video-avatar-1\" style=\"font-size:24px\"><strong>What is HunyuanVideo-Avatar?<\/strong><\/h3>\n\n\n\n<p><\/p>\n\n\n\n<p>HunyuanVideo-Avatar is an open-source audio-driven human animation model from Tencent Hunyuan and Tencent Music Entertainment Lyra Lab. It takes a portrait image and an audio clip as input and generates a realistic talking avatar video with emotion control and multi-character support. The paper was submitted to arXiv on May 26, 2025 (arXiv:2505.20156).<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"what-makes-hunyuan-video-avatar-different-from-other-talking-avatar-models\" style=\"font-size:24px\"><strong>What makes HunyuanVideo-Avatar different from other talking avatar models?<\/strong><\/h3>\n\n\n\n<p><\/p>\n\n\n\n<p>HunyuanVideo-Avatar introduces three architectural innovations not found together in any other open-source model: (1) a Character Image Injection Module for stable character identity in dynamic video, (2) an Audio Emotion Module (AEM) for emotion transfer via a reference image, and (3) a Face-Aware Audio Adapter (FAA) that enables true multi-character animation with independent audio streams.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"does-hunyuan-video-avatar-support-cartoon-or-non-photorealistic-characters\" style=\"font-size:24px\"><strong>Does HunyuanVideo-Avatar support cartoon or non-photorealistic characters?<\/strong><\/h3>\n\n\n\n<p><\/p>\n\n\n\n<p>Yes. HunyuanVideo-Avatar supports photorealistic, cartoon\/illustrated, 3D-rendered, and anthropomorphic character styles at arbitrary input resolutions. It is not limited to realistic photographs.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"what-gpu-do-i-need-to-run-hunyuan-video-avatar\" style=\"font-size:24px\"><strong>What GPU do I need to run HunyuanVideo-Avatar?<\/strong><\/h3>\n\n\n\n<p><\/p>\n\n\n\n<p>The minimum is 24GB VRAM, though this is described as &#8220;very slow.&#8221; The recommended setup is 80\u201396GB VRAM for quality output. A 10GB VRAM option is available via the Wan2GP TeaCache integration, added June 6, 2025.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"is-hunyuan-video-avatar-free\" style=\"font-size:24px\"><strong>Is HunyuanVideo-Avatar free?<\/strong><\/h3>\n\n\n\n<p><\/p>\n\n\n\n<p>Yes. The inference code and model weights are open-source, released May 28, 2025. A live cloud demo is also available on the Tencent Hunyuan platform at no cost for testing.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"does-hunyuan-video-avatar-work-on-windows\" style=\"font-size:24px\"><strong>Does HunyuanVideo-Avatar work on Windows?<\/strong><\/h3>\n\n\n\n<p><\/p>\n\n\n\n<p>The officially tested operating system is Linux. Windows is not officially supported. Users attempting Windows installation should expect unsupported configuration issues.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"what-is-the-audio-emotion-module-aem\" style=\"font-size:24px\"><strong>What is the Audio Emotion Module (AEM)?<\/strong><\/h3>\n\n\n\n<p><\/p>\n\n\n\n<p>The AEM is a component of HunyuanVideo-Avatar that extracts emotional style from a user-provided emotion reference image and transfers it to the generated video. This allows precise control over the character&#8217;s facial expressions \u2014 independent of the emotional content of the driving audio \u2014 enabling fine-grained emotion direction.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"what-is-the-face-aware-audio-adapter-faa\" style=\"font-size:24px\"><strong>What is the Face-Aware Audio Adapter (FAA)?<\/strong><\/h3>\n\n\n\n<p><\/p>\n\n\n\n<p>The FAA is the module that enables multi-character animation. It uses a latent-level face mask to isolate each character&#8217;s face region in the diffusion model&#8217;s latent space, then routes each character&#8217;s audio signal through a dedicated cross-attention mechanism. This prevents audio intended for one character from influencing another character&#8217;s facial motion.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"can-hunyuan-video-avatar-generate-full-body-video\" style=\"font-size:24px\"><strong>Can HunyuanVideo-Avatar generate full-body video?<\/strong><\/h3>\n\n\n\n<p><\/p>\n\n\n\n<p>Yes. The model supports portrait (face\/neck), upper-body, and full-body generation scales.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"how-do-i-cite-hunyuan-video-avatar-in-research\" style=\"font-size:24px\"><strong>How do I cite HunyuanVideo-Avatar in research?<\/strong><\/h3>\n\n\n\n<p><\/p>\n\n\n\n<p>@misc{chen2025hunyuanvideoavatarhighfidelityaudiodrivenhuman,<\/p>\n\n\n\n<p>&nbsp;&nbsp;title={HunyuanVideo-Avatar: High-Fidelity Audio-Driven Human Animation for Multiple Characters},<\/p>\n\n\n\n<p>&nbsp;&nbsp;author={Yi Chen and Sen Liang and Zixiang Zhou and Ziyao Huang and Yifeng Ma<\/p>\n\n\n\n<p>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;and Junshu Tang and Qin Lin and Yuan Zhou and Qinglin Lu},<\/p>\n\n\n\n<p>&nbsp;&nbsp;year={2025},<\/p>\n\n\n\n<p>&nbsp;&nbsp;eprint={2505.20156},<\/p>\n\n\n\n<p>&nbsp;&nbsp;archivePrefix={arXiv},<\/p>\n\n\n\n<p>&nbsp;&nbsp;primaryClass={cs.CV},<\/p>\n\n\n\n<p>&nbsp;&nbsp;url={https:\/\/arxiv.org\/abs\/2505.20156},<\/p>\n\n\n\n<p>}<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"is-comfy-ui-support-available-for-hunyuan-video-avatar\" style=\"font-size:24px\"><strong>Is ComfyUI support available for HunyuanVideo-Avatar?<\/strong><\/h3>\n\n\n\n<p><\/p>\n\n\n\n<p>ComfyUI support is on the official open-source roadmap but has not been released yet. Check the GitHub repository for current status.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"official-resources\"><strong>Official Resources<\/strong><\/h2>\n\n\n\n<p><\/p>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><tbody><tr><td><strong>Resource<\/strong><\/td><td><strong>Link<\/strong><\/td><\/tr><tr><td>Project Page (video demos)<\/td><td><a href=\"https:\/\/hunyuanvideo-avatar.github.io\/\" rel=\"nofollow noopener\" target=\"_blank\">hunyuanvideo-avatar.github.io<\/a><\/td><\/tr><tr><td>GitHub Repository<\/td><td><a href=\"https:\/\/github.com\/Tencent-Hunyuan\/HunyuanVideo-Avatar\" rel=\"nofollow noopener\" target=\"_blank\">Tencent-Hunyuan\/HunyuanVideo-Avatar<\/a><\/td><\/tr><tr><td>Hugging Face Model<\/td><td><a href=\"https:\/\/huggingface.co\/tencent\/HunyuanVideo-Avatar\" rel=\"nofollow noopener\" target=\"_blank\">tencent\/HunyuanVideo-Avatar<\/a><\/td><\/tr><tr><td>Live Cloud Demo<\/td><td><a href=\"https:\/\/hunyuan.tencent.com\/modelSquare\/home\/play?modelId=126\" rel=\"nofollow noopener\" target=\"_blank\">Tencent Hunyuan Platform<\/a><\/td><\/tr><tr><td>arXiv Paper<\/td><td><a href=\"https:\/\/arxiv.org\/abs\/2505.20156\" rel=\"nofollow noopener\" target=\"_blank\">arxiv.org\/abs\/2505.20156<\/a><\/td><\/tr><tr><td>Wan2GP (10GB VRAM support)<\/td><td><a href=\"https:\/\/huggingface.co\/spaces\/VIDraft\/Wan2GP\" rel=\"nofollow noopener\" target=\"_blank\">Wan2GP on Hugging Face<\/a><\/td><\/tr><\/tbody><\/table><\/figure>\n","protected":false},"excerpt":{"rendered":"<p>HunyuanVideo-Avatar by Tencent animates photos into talking avatars with emotion control &amp; multi-character support. Free, open-source. Full setup guide.<\/p>\n","protected":false},"author":2,"featured_media":1897,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[11],"tags":[],"class_list":["post-1895","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-avatar"],"_links":{"self":[{"href":"https:\/\/gaga.art\/blog\/wp-json\/wp\/v2\/posts\/1895","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/gaga.art\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/gaga.art\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/gaga.art\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/gaga.art\/blog\/wp-json\/wp\/v2\/comments?post=1895"}],"version-history":[{"count":1,"href":"https:\/\/gaga.art\/blog\/wp-json\/wp\/v2\/posts\/1895\/revisions"}],"predecessor-version":[{"id":1898,"href":"https:\/\/gaga.art\/blog\/wp-json\/wp\/v2\/posts\/1895\/revisions\/1898"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/gaga.art\/blog\/wp-json\/wp\/v2\/media\/1897"}],"wp:attachment":[{"href":"https:\/\/gaga.art\/blog\/wp-json\/wp\/v2\/media?parent=1895"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/gaga.art\/blog\/wp-json\/wp\/v2\/categories?post=1895"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/gaga.art\/blog\/wp-json\/wp\/v2\/tags?post=1895"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}