{"id":1789,"date":"2026-03-02T17:26:41","date_gmt":"2026-03-02T09:26:41","guid":{"rendered":"https:\/\/gaga.art\/blog\/?p=1789"},"modified":"2026-03-02T17:26:43","modified_gmt":"2026-03-02T09:26:43","slug":"fun-cosyvoice3-5-and-fun-audiogen-vd","status":"publish","type":"post","link":"https:\/\/gaga.art\/blog\/fun-cosyvoice3-5-and-fun-audiogen-vd\/","title":{"rendered":"Fun-CosyVoice3.5 &amp; Fun-AudioGen-VD: Alibaba&#8217;s AI Voice Revolution"},"content":{"rendered":"\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"523\" src=\"https:\/\/gaga.art\/blog\/wp-content\/uploads\/2026\/03\/fun-cosyvoice3.5-and-fun-audiogen-vd-1024x523.webp\" alt=\"fun-cosyvoice3.5 and fun-audiogen-vd\" class=\"wp-image-1790\" srcset=\"https:\/\/gaga.art\/blog\/wp-content\/uploads\/2026\/03\/fun-cosyvoice3.5-and-fun-audiogen-vd-1024x523.webp 1024w, https:\/\/gaga.art\/blog\/wp-content\/uploads\/2026\/03\/fun-cosyvoice3.5-and-fun-audiogen-vd-300x153.webp 300w, https:\/\/gaga.art\/blog\/wp-content\/uploads\/2026\/03\/fun-cosyvoice3.5-and-fun-audiogen-vd-768x393.webp 768w, https:\/\/gaga.art\/blog\/wp-content\/uploads\/2026\/03\/fun-cosyvoice3.5-and-fun-audiogen-vd.webp 1440w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p><\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"key-takeaways\"><strong>Key Takeaways<\/strong><\/h2>\n\n\n\n<p><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Fun-CosyVoice3.5<\/strong> is Alibaba Tongyi&#8217;s upgraded multilingual voice cloning model with FreeStyle instruction control across 13 languages.<\/li>\n\n\n\n<li><strong>Fun-AudioGen-VD<\/strong> generates complete auditory scenes\u2014combining custom voice design with immersive environmental audio.<\/li>\n\n\n\n<li>Both models use natural language commands instead of rigid preset tags, making professional voice synthesis accessible without technical expertise.<\/li>\n\n\n\n<li>First-packet latency reduced by 35%; rare character mispronunciation rate cut from 15.2% to 5.3%.<\/li>\n\n\n\n<li>Available via Alibaba Cloud&#8217;s DashScope API (cosyvoice-v3.5-plus and cosyvoice-v3.5-flash).<\/li>\n<\/ul>\n\n\n\n<p><\/p>\n\n\n\n<div class=\"wp-block-rank-math-toc-block has-custom-cd-994-c-color has-text-color has-link-color wp-elements-fe9c4702b5b06edadd4e97cb343ba356\" id=\"rank-math-toc\"><p>Table of Contents<\/p><nav><ul><li><a href=\"#key-takeaways\">Key Takeaways<\/a><\/li><li><a href=\"#what-are-fun-cosy-voice-3-5-and-fun-audio-gen-vd\">What Are Fun-CosyVoice3.5 and Fun-AudioGen-VD?<\/a><\/li><li><a href=\"#why-the-free-style-approach-changes-everything\">Why the &#8220;FreeStyle&#8221; Approach Changes Everything<\/a><\/li><li><a href=\"#fun-cosy-voice-3-5-deep-dive\">Fun-CosyVoice3.5: Deep Dive<\/a><ul><li><a href=\"#what-does-fun-cosy-voice-3-5-do\">What Does Fun-CosyVoice3.5 Do?<\/a><\/li><li><a href=\"#core-capabilities\">Core Capabilities<\/a><ul><li><a href=\"#free-style-instruct-tts\">FreeStyle Instruct-TTS<\/a><\/li><li><a href=\"#multilingual-support-now-13-languages\">Multilingual Support \u2014 Now 13 Languages<\/a><\/li><li><a href=\"#dramatically-improved-pronunciation-accuracy\">Dramatically Improved Pronunciation Accuracy<\/a><\/li><li><a href=\"#better-naturalness-via-reinforcement-learning\">Better Naturalness via Reinforcement Learning<\/a><\/li><\/ul><\/li><li><a href=\"#how-to-use-fun-cosy-voice-3-5-via-api\">How to Use Fun-CosyVoice3.5 via API<\/a><\/li><\/ul><\/li><li><a href=\"#fun-audio-gen-vd-deep-dive\">Fun-AudioGen-VD: Deep Dive<\/a><ul><li><a href=\"#what-does-fun-audio-gen-vd-do\">What Does Fun-AudioGen-VD Do?<\/a><\/li><li><a href=\"#controllable-voice-design\">Controllable Voice Design<\/a><\/li><li><a href=\"#immersive-scene-audio-generation\">Immersive Scene Audio Generation<\/a><\/li><li><a href=\"#primary-use-cases-for-fun-audio-gen-vd\">Primary Use Cases for Fun-AudioGen-VD<\/a><\/li><\/ul><\/li><li><a href=\"#fun-cosy-voice-3-5-vs-fun-audio-gen-vd-which-one-do-you-need\">Fun-CosyVoice3.5 vs. Fun-AudioGen-VD: Which One Do You Need?<\/a><\/li><li><a href=\"#bonus-gaga-ai-taking-ai-voice-into-ai-video\">BONUS: Gaga AI \u2014 Taking AI Voice Into AI Video<\/a><ul><li><a href=\"#what-is-gaga-ai\">What Is Gaga AI?<\/a><\/li><li><a href=\"#key-features\">Key Features<\/a><ul><li><a href=\"#image-to-video-ai\">Image to Video AI<\/a><\/li><li><a href=\"#video-and-audio-infusion-gaga-1-model\">Video and Audio Infusion (Gaga-1 Model)<\/a><\/li><li><a href=\"#ai-avatar\">AI Avatar<\/a><\/li><li><a href=\"#ai-voice-clone\">AI Voice Clone<\/a><\/li><li><a href=\"#text-to-speech-tts\">Text-to-Speech (TTS)<\/a><\/li><\/ul><\/li><li><a href=\"#why-gaga-ai-fun-cosy-voice-3-5-is-a-powerful-combination\">Why Gaga AI + Fun-CosyVoice3.5 Is a Powerful Combination<\/a><\/li><\/ul><\/li><li><a href=\"#faq-fun-cosy-voice-3-5-and-fun-audio-gen-vd\">FAQ: Fun-CosyVoice3.5 and Fun-AudioGen-VD<\/a><\/li><\/ul><\/nav><\/div>\n\n\n\n<p><\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"what-are-fun-cosy-voice-3-5-and-fun-audio-gen-vd\"><strong>What Are Fun-CosyVoice3.5 and Fun-AudioGen-VD?<\/strong><\/h2>\n\n\n\n<p><\/p>\n\n\n\n<p><a href=\"https:\/\/help.aliyun.com\/zh\/model-studio\/cosyvoice-clone-design-api?spm=a2c4g.11186623.help-menu-search-2400256.d_2\" rel=\"nofollow noopener\" target=\"_blank\"><strong>Fun-CosyVoice3.5 and Fun-AudioGen-VD<\/strong><\/a><strong> are two AI speech models released by Alibaba&#8217;s Tongyi Lab on March 2, 2026<\/strong>, both built around a &#8220;FreeStyle&#8221; instruction paradigm that lets users control voice output through plain-text descriptions instead of fixed parameter menus.<\/p>\n\n\n\n<p>Traditional TTS systems force users to pick from dropdown menus\u2014preset emotions, rigid style tags, limited tone options. These two models break that pattern entirely. You describe what you want in everyday language, and the model delivers.<\/p>\n\n\n\n<p>They share the same core philosophy but serve different purposes:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Fun-CosyVoice3.5<\/strong> \u2014 focused on voice cloning and expressive speech control<\/li>\n\n\n\n<li><strong>Fun-AudioGen-VD<\/strong> \u2014 focused on voice design and full-scene audio generation<\/li>\n<\/ul>\n\n\n\n<p><\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"why-the-free-style-approach-changes-everything\"><strong>Why the &#8220;FreeStyle&#8221; Approach Changes Everything<\/strong><\/h2>\n\n\n\n<p><\/p>\n\n\n\n<p>The fundamental problem with earlier voice synthesis tools was control rigidity. Users were constrained to a fixed set of emotion labels (&#8220;happy,&#8221; &#8220;sad,&#8221; &#8220;neutral&#8221;) with no way to express nuanced instructions like <em>&#8220;sound calm on the surface but slightly tense underneath.&#8221;<\/em><\/p>\n\n\n\n<p>FreeStyle removes that ceiling. Instead of selecting tags, you write instructions:<\/p>\n\n\n\n<p><em>&#8220;Lower the pitch slightly, slow the pace, and add a hint of fatigue.&#8221;<\/em><\/p>\n\n\n\n<p>The model interprets that sentence and renders it. This single shift moves voice generation from a <strong>configuration task<\/strong> into a <strong>creative task<\/strong>\u2014lowering the skill floor while raising the quality ceiling.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"fun-cosy-voice-3-5-deep-dive\"><strong>Fun-CosyVoice3.5: Deep Dive<\/strong><\/h2>\n\n\n\n<p><\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"what-does-fun-cosy-voice-3-5-do\" style=\"font-size:24px\"><strong>What Does Fun-CosyVoice3.5 Do?<\/strong><\/h3>\n\n\n\n<p><\/p>\n\n\n\n<p>Fun-CosyVoice3.5 is a multilingual voice cloning and expressive TTS model. It takes a reference audio sample (10\u201320 seconds is sufficient) and replicates that voice with high fidelity, then lets you steer delivery through natural language prompts.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"core-capabilities\" style=\"font-size:24px\"><strong>Core Capabilities<\/strong><\/h3>\n\n\n\n<p><\/p>\n\n\n\n<h4 class=\"wp-block-heading\" id=\"free-style-instruct-tts\"><strong>FreeStyle Instruct-TTS<\/strong><\/h4>\n\n\n\n<p><\/p>\n\n\n\n<p>You describe the tone and delivery in a single sentence. Examples from Alibaba&#8217;s documentation:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><em>&#8220;Simulate a navigation assistant&#8217;s cheerful arrival message\u2014light tone, a sense of journey completed.&#8221;<\/em><\/li>\n\n\n\n<li><em>&#8220;Simulate a Cantonese news journalist asking a guest a question\u2014clear, steady, authoritative.&#8221;<\/em><\/li>\n<\/ul>\n\n\n\n<p><\/p>\n\n\n\n<p>The model handles both the voice replication and the expressive layering in one pass.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\" id=\"multilingual-support-now-13-languages\"><strong>Multilingual Support \u2014 Now 13 <\/strong>Languages<\/h4>\n\n\n\n<p><\/p>\n\n\n\n<p>Version 3.5 adds Thai, Indonesian, Portuguese, and Vietnamese to the existing lineup. Full language support now covers:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Chinese (Mandarin + 16 regional dialects including Cantonese, Shanghainese, Sichuan)<\/li>\n\n\n\n<li>English, French, German, Japanese, Korean, Russian<\/li>\n\n\n\n<li>Portuguese, Thai, Indonesian, Vietnamese<\/li>\n<\/ul>\n\n\n\n<p><\/p>\n\n\n\n<p>Across all 13 languages, Alibaba claims industry-leading scores on Word Error Rate (WER) and Speaker Similarity (SpkSim) benchmarks.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\" id=\"dramatically-improved-pronunciation-accuracy\"><strong>Dramatically Improved Pronunciation Accuracy<\/strong><\/h4>\n\n\n\n<p><\/p>\n\n\n\n<p>The model was specifically optimized for rare characters, classical Chinese text, and complex sentence structures. The result:<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><tbody><tr><td><strong>Metric<\/strong><\/td><td><strong>Before<\/strong><\/td><td><strong>After<\/strong><\/td><\/tr><tr><td>Rare character error rate<\/td><td>15.2%<\/td><td>5.3%<\/td><\/tr><tr><td>Long-form stability<\/td><td>Inconsistent<\/td><td>Significantly improved<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p>This matters for content creators reading academic papers, legal documents, classical literature, or technical manuals.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\" id=\"better-naturalness-via-reinforcement-learning\"><strong>Better Naturalness via Reinforcement Learning<\/strong><\/h4>\n\n\n\n<p><\/p>\n\n\n\n<p>Tongyi Lab used two RL-based fine-tuning methods:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>DiffRO + GRPO<\/strong> on the language model layer \u2014 improves rhythm and prosody with multi-channel duration rewards<\/li>\n\n\n\n<li><strong>Flow-GRPO<\/strong> on the audio generation layer \u2014 improves voice similarity and audio quality<\/li>\n<\/ul>\n\n\n\n<p><\/p>\n\n\n\n<p>The result is speech that sounds more layered and human, rather than flat or robotic.<\/p>\n\n\n\n<p><strong>Performance Improvements<\/strong><\/p>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><tbody><tr><td><strong>Metric<\/strong><\/td><td><strong>Improvement<\/strong><\/td><\/tr><tr><td>Tokenizer frame rate<\/td><td>Halved<\/td><\/tr><tr><td>First-packet latency<\/td><td>Reduced by 35%<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p>These changes matter for real-time applications\u2014live streaming, customer service bots, interactive voice agents\u2014where delays break the experience.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"how-to-use-fun-cosy-voice-3-5-via-api\" style=\"font-size:24px\"><strong>How to Use Fun-CosyVoice3.5 via API<\/strong><\/h3>\n\n\n\n<p><\/p>\n\n\n\n<p>The model is available through Alibaba Cloud&#8217;s DashScope SDK. Here&#8217;s a minimal Python example to clone a voice:<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-vivid-green-cyan-color has-text-color has-link-color has-fixed-layout\"><tbody><tr><td>from dashscope.audio.tts_v2 import VoiceEnrollmentService<br>service = VoiceEnrollmentService()<br>voice_id = service.create_voice(&nbsp;&nbsp;&nbsp;&nbsp;target_model=&#8217;cosyvoice-v3.5-plus&#8217;,&nbsp;&nbsp;&nbsp;&nbsp;prefix=&#8217;myvoice&#8217;,&nbsp;&nbsp;&nbsp;&nbsp;url=&#8217;https:\/\/your-audio-file-url&#8217;,&nbsp;&nbsp;&nbsp;&nbsp;language_hints=[&#8216;zh&#8217;]&nbsp; # or &#8216;en&#8217;, &#8216;pt&#8217;, &#8216;th&#8217;, &#8216;id&#8217;, &#8216;vi&#8217;, etc.)<br>print(f&#8221;Voice ID: {voice_id}&#8221;)<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p><strong>Key parameters to know:<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>target_model \u2014 must match the model used in your synthesis call later<\/li>\n\n\n\n<li>prefix \u2014 alphanumeric label (max 10 characters) for your voice ID<\/li>\n\n\n\n<li>url \u2014 public URL to your reference audio (10\u201320 seconds, clear, minimal noise)<\/li>\n\n\n\n<li>language_hints \u2014 helps the model identify the source audio language for better cloning<\/li>\n<\/ul>\n\n\n\n<p><\/p>\n\n\n\n<p><strong>Voice quota:<\/strong> Up to 1,000 custom voices per account. Voices unused for 12 months are auto-deleted. Creating and managing voices is free; synthesis is billed per character.<\/p>\n\n\n\n<p><strong>Common troubleshooting tips:<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Use WAV over MP3 for source audio (avoids lossy compression artifacts)<\/li>\n\n\n\n<li>Keep speech continuous \u2014 avoid gaps longer than 2 seconds<\/li>\n\n\n\n<li>Ensure at least 60% of the audio clip is active speech<\/li>\n\n\n\n<li>Recommended sample rate: 16kHz or higher, mono channel<\/li>\n<\/ul>\n\n\n\n<p><\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"fun-audio-gen-vd-deep-dive\"><strong>Fun-AudioGen-VD: Deep Dive<\/strong><\/h2>\n\n\n\n<p><\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"what-does-fun-audio-gen-vd-do\" style=\"font-size:24px\"><strong>What Does Fun-AudioGen-VD Do?<\/strong><\/h3>\n\n\n\n<p><\/p>\n\n\n\n<p>Fun-AudioGen-VD is Alibaba&#8217;s scene-based audio generation model. Where Fun-CosyVoice3.5 clones and refines existing voices, Fun-AudioGen-VD <strong>creates voices from scratch<\/strong> based on text descriptions\u2014and wraps them in fully designed acoustic environments.<\/p>\n\n\n\n<p>Think of it as the difference between a voice actor (CosyVoice3.5) and a full production studio (AudioGen-VD).<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"controllable-voice-design\" style=\"font-size:24px\"><strong>Controllable Voice Design<\/strong><\/h3>\n\n\n\n<p><\/p>\n\n\n\n<p>You can specify every dimension of a voice without recording a single second of audio:<\/p>\n\n\n\n<p><strong>Basic attributes:<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Gender, age, accent, pitch, speech rate<\/li>\n<\/ul>\n\n\n\n<p><\/p>\n\n\n\n<p><strong>Timbral qualities:<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Husky, bright, deep, magnetic, breathy<\/li>\n<\/ul>\n\n\n\n<p><\/p>\n\n\n\n<p><strong>Emotional states:<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Anger, sadness, excitement, determination, anxiety<\/li>\n<\/ul>\n\n\n\n<p><\/p>\n\n\n\n<p><strong>Role simulation:<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Customer service agent, military veteran, child, AI assistant, news broadcaster<\/li>\n<\/ul>\n\n\n\n<p><\/p>\n\n\n\n<p><strong>Complex psychological states:<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><em>&#8220;Calm on the surface but trembling inside&#8221;<\/em><\/li>\n\n\n\n<li><em>&#8220;Confident but hiding exhaustion&#8221;<\/em><\/li>\n<\/ul>\n\n\n\n<p><\/p>\n\n\n\n<p>Example instruction used by Tongyi Lab:<\/p>\n\n\n\n<p><em>&#8220;Character: deranged villain. Acoustic style: sinister and erratic. Voice: shrill. Requirement: pitch spikes mid-sentence unpredictably, with irregular swallowing sounds and dismissive laughter, full of arrogance and psychological distortion.&#8221;<\/em><\/p>\n\n\n\n<p>The model generates a voice that fits that description without any reference audio needed.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"immersive-scene-audio-generation\" style=\"font-size:24px\"><strong>Immersive Scene Audio Generation<\/strong><\/h3>\n\n\n\n<p><\/p>\n\n\n\n<p>Fun-AudioGen-VD doesn&#8217;t stop at voice. It builds the sonic environment around it:<\/p>\n\n\n\n<p><strong>Background environments:<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Urban street noise, caf\u00e9 ambiance, battlefield explosions, forest sounds<\/li>\n<\/ul>\n\n\n\n<p><\/p>\n\n\n\n<p><strong>Spatial reverb effects:<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Cathedral acoustics, metal prison cells, underwater echo, small room reverb<\/li>\n<\/ul>\n\n\n\n<p><\/p>\n\n\n\n<p><strong>Device-style filters:<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Vintage radio crackle, walkie-talkie compression, breathing mask muffling<\/li>\n<\/ul>\n\n\n\n<p><\/p>\n\n\n\n<p><strong>Dynamic environmental interactions:<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Wind noise that fluctuates, echoes that shift with distance, progressive hoarseness<\/li>\n<\/ul>\n\n\n\n<p><\/p>\n\n\n\n<p>Example instruction:<\/p>\n\n\n\n<p><em>&#8220;Scene: a busy caf\u00e9. Background: coffee grinder hum, clink of ceramic cups, distant murmur of conversations. Speaker tone: relaxed, like chatting over afternoon tea.&#8221;<\/em><\/p>\n\n\n\n<p>The output isn&#8217;t just the voice\u2014it&#8217;s the entire acoustic scene baked in.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"primary-use-cases-for-fun-audio-gen-vd\" style=\"font-size:24px\"><strong>Primary Use Cases for Fun-AudioGen-VD<\/strong><\/h3>\n\n\n\n<p><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Game development<\/strong> \u2014 Generate NPC voices and ambient audio from text descriptions, no recording studio needed<\/li>\n\n\n\n<li><strong>Film and animation<\/strong> \u2014 Rapidly prototype character voices and scene audio before final production<\/li>\n\n\n\n<li><strong>Audiobooks and podcasts<\/strong> \u2014 Create unique voice identities for different characters without hiring multiple voice actors<\/li>\n\n\n\n<li><strong>Advertising<\/strong> \u2014 Design brand voices from scratch with precise timbral and emotional specifications<\/li>\n\n\n\n<li><strong>Training data generation<\/strong> \u2014 Produce high-quality reference audio for other voice cloning pipelines<\/li>\n<\/ul>\n\n\n\n<p><\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"fun-cosy-voice-3-5-vs-fun-audio-gen-vd-which-one-do-you-need\"><strong>Fun-CosyVoice3.5 vs. Fun-AudioGen-VD: Which One Do You Need?<\/strong><\/h2>\n\n\n\n<p><\/p>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><tbody><tr><td><strong>Need<\/strong><\/td><td><strong>Use This<\/strong><\/td><\/tr><tr><td>Clone a real person&#8217;s voice<\/td><td>Fun-CosyVoice3.5<\/td><\/tr><tr><td>Control how an existing voice is delivered<\/td><td>Fun-CosyVoice3.5<\/td><\/tr><tr><td>Create a completely new voice from a description<\/td><td>Fun-AudioGen-VD<\/td><\/tr><tr><td>Generate a voice + environmental audio together<\/td><td>Fun-AudioGen-VD<\/td><\/tr><tr><td>Multilingual content production<\/td><td>Fun-CosyVoice3.5<\/td><\/tr><tr><td>Game\/film character audio<\/td><td>Fun-AudioGen-VD<\/td><\/tr><tr><td>Real-time applications (low latency required)<\/td><td>Fun-CosyVoice3.5<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p>The models are designed to complement each other. Fun-AudioGen-VD can generate high-quality reference audio that Fun-CosyVoice3.5 can then clone and deploy at scale.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"bonus-gaga-ai-taking-ai-voice-into-ai-video\"><strong>BONUS: Gaga AI \u2014 Taking AI Voice Into AI Video<\/strong><\/h2>\n\n\n\n<p><\/p>\n\n\n\n<p>If Fun-CosyVoice3.5 and Fun-AudioGen-VD handle the audio layer, <a href=\"https:\/\/gaga.art\/en\/\"><strong>Gaga AI<\/strong><\/a> tackles the full multimedia production stack\u2014combining AI-generated video, voice cloning, and avatar creation into one platform.<\/p>\n\n\n\n<p><\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1000\" height=\"653\" src=\"https:\/\/gaga.art\/blog\/wp-content\/uploads\/2025\/09\/gaga-ai-video-generator-studio.webp\" alt=\"gaga ai video generator studio\" class=\"wp-image-206\" srcset=\"https:\/\/gaga.art\/blog\/wp-content\/uploads\/2025\/09\/gaga-ai-video-generator-studio.webp 1000w, https:\/\/gaga.art\/blog\/wp-content\/uploads\/2025\/09\/gaga-ai-video-generator-studio-300x196.webp 300w, https:\/\/gaga.art\/blog\/wp-content\/uploads\/2025\/09\/gaga-ai-video-generator-studio-768x502.webp 768w\" sizes=\"auto, (max-width: 1000px) 100vw, 1000px\" \/><\/figure>\n\n\n\n<p><\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"what-is-gaga-ai\" style=\"font-size:24px\"><strong>What Is Gaga AI?<\/strong><\/h3>\n\n\n\n<p><\/p>\n\n\n\n<p>Gaga AI is an AI-powered content creation platform built around the <a href=\"https:\/\/gaga.art\/en\/gaga-1\"><strong>Gaga-1 model<\/strong><\/a>, which fuses video generation with synchronized audio\u2014voice, music, and ambient sound\u2014in a single generation pass.<\/p>\n\n\n\n<p><\/p>\n\n\n\n<div class=\"wp-block-buttons is-content-justification-center is-layout-flex wp-container-core-buttons-is-layout-a89b3969 wp-block-buttons-is-layout-flex\">\n<div class=\"wp-block-button\"><a class=\"wp-block-button__link wp-element-button\" href=\"http:\/\/gaga.art\/app\" target=\"_blank\" rel=\"noreferrer noopener\">Generate Video Free<\/a><\/div>\n\n\n\n<div class=\"wp-block-button\"><a class=\"wp-block-button__link wp-element-button\" href=\"https:\/\/gaga.art\/\">Learn Gaga AI<\/a><\/div>\n<\/div>\n\n\n\n<p><\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"key-features\" style=\"font-size:24px\"><strong>Key Features<\/strong><\/h3>\n\n\n\n<p><\/p>\n\n\n\n<h4 class=\"wp-block-heading has-vivid-red-color has-text-color has-link-color wp-elements-46438f76300271423f8275d22d2a6cf6\" id=\"image-to-video-ai\"><strong>Image to Video AI<\/strong><\/h4>\n\n\n\n<p><\/p>\n\n\n\n<p>Upload a static image and Gaga-1 animates it into a coherent video clip. The model understands scene context, lighting, and subject motion, producing smooth, realistic output without manual keyframing.<\/p>\n\n\n\n<h4 class=\"wp-block-heading has-vivid-red-color has-text-color has-link-color wp-elements-8f7c6505bf04e42850d915f9e4754914\" id=\"video-and-audio-infusion-gaga-1-model\"><strong>Video and Audio Infusion (Gaga-1 Model)<\/strong><\/h4>\n\n\n\n<p><\/p>\n\n\n\n<p>The Gaga-1 model&#8217;s core innovation is the simultaneous generation of video and its acoustic environment. Rather than generating silent video and adding audio in post-production, Gaga-1 produces both in sync\u2014dialogue, background noise, and sound effects all aligned to the visual action.<\/p>\n\n\n\n<h4 class=\"wp-block-heading has-vivid-red-color has-text-color has-link-color wp-elements-22ba35bb3b78452e516623eee4f2dcce\" id=\"ai-avatar\"><strong>AI Avatar<\/strong><\/h4>\n\n\n\n<p><\/p>\n\n\n\n<p>Create a photorealistic or stylized digital avatar that speaks, moves, and emotes. Useful for:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Corporate training videos without on-camera talent<\/li>\n\n\n\n<li>Multilingual content (swap voice and lip-sync language)<\/li>\n\n\n\n<li>Brand mascots and virtual presenters<\/li>\n<\/ul>\n\n\n\n<p><\/p>\n\n\n\n<h4 class=\"wp-block-heading has-vivid-red-color has-text-color has-link-color wp-elements-5362b428ed82ea220fe81b4a48854075\" id=\"ai-voice-clone\"><strong>AI Voice Clone<\/strong><\/h4>\n\n\n\n<p><\/p>\n\n\n\n<p>Gaga AI includes a voice cloning layer that works alongside (or independently from) its video generation pipeline. Record a short sample, and the platform replicates that voice for use across all generated content\u2014consistent brand voice at scale.<\/p>\n\n\n\n<h4 class=\"wp-block-heading has-vivid-red-color has-text-color has-link-color wp-elements-edcbbb0d8f87d7b5545f270bb64bec78\" id=\"text-to-speech-tts\"><strong>Text-to-Speech (TTS)<\/strong><\/h4>\n\n\n\n<p><\/p>\n\n\n\n<p>A built-in TTS engine handles script-to-voice generation for avatars and video narration, with style and emotion controls that mirror the FreeStyle paradigm seen in Alibaba&#8217;s models.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"why-gaga-ai-fun-cosy-voice-3-5-is-a-powerful-combination\" style=\"font-size:24px\"><strong>Why Gaga AI + Fun-CosyVoice3.5 Is a Powerful Combination<\/strong><\/h3>\n\n\n\n<p><\/p>\n\n\n\n<p>Use Fun-CosyVoice3.5 or Fun-AudioGen-VD to design or clone your ideal voice with precision. Export that audio and feed it into Gaga AI&#8217;s video pipeline to create avatar-driven video content with that exact voice, fully synced and animated.<\/p>\n\n\n\n<p>This workflow bridges the gap between audio perfection and visual production\u2014giving creators a complete, AI-driven content pipeline from script to finished video.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"faq-fun-cosy-voice-3-5-and-fun-audio-gen-vd\"><strong>FAQ: Fun-CosyVoice3.5 and Fun-AudioGen-VD<\/strong><\/h2>\n\n\n\n<p><\/p>\n\n\n\n<p class=\"has-vivid-red-color has-text-color has-link-color wp-elements-1e63cf2364e434c9d427d84cabbf7353\"><strong>What is Fun-CosyVoice3.5?<\/strong><\/p>\n\n\n\n<p>Fun-CosyVoice3.5 is a multilingual voice cloning and expressive TTS model from Alibaba&#8217;s Tongyi Lab. It supports 13 languages and allows users to control speech delivery using plain-text instructions rather than preset tags.<\/p>\n\n\n\n<p class=\"has-vivid-red-color has-text-color has-link-color wp-elements-e49ea0488fed1d5a5ade34eda9c61d37\"><strong>What is Fun-AudioGen-VD?<\/strong><\/p>\n\n\n\n<p>Fun-AudioGen-VD is Alibaba&#8217;s scene-based audio generation model. It creates custom voices from text descriptions and generates full acoustic environments\u2014background noise, reverb, device filters\u2014alongside the voice.<\/p>\n\n\n\n<p class=\"has-vivid-red-color has-text-color has-link-color wp-elements-fa9ecf8108f3e221ee993eee68748596\"><strong>How is FreeStyle instruction control different from standard TTS?<\/strong><\/p>\n\n\n\n<p>Standard TTS uses fixed labels like &#8220;happy&#8221; or &#8220;neutral.&#8221; FreeStyle lets you write any natural language description\u2014&#8221;sound tired but trying to hide it&#8221;\u2014and the model interprets and renders it. No preset menu required.<\/p>\n\n\n\n<p class=\"has-vivid-red-color has-text-color has-link-color wp-elements-9863d56568d86ef175da83b53404c794\"><strong>What languages does Fun-CosyVoice3.5 support?<\/strong><\/p>\n\n\n\n<p>13 languages: Chinese (Mandarin + 16 dialects), English, French, German, Japanese, Korean, Russian, Portuguese, Thai, Indonesian, and Vietnamese.<\/p>\n\n\n\n<p class=\"has-vivid-red-color has-text-color has-link-color wp-elements-dfeeeb3e3553d1267979a9c871d20413\"><strong>How much audio do I need to clone a voice with Fun-CosyVoice3.5?<\/strong><\/p>\n\n\n\n<p>10 to 20 seconds of clear audio is sufficient. Longer isn&#8217;t necessarily better\u2014quality matters more than duration.<\/p>\n\n\n\n<p class=\"has-vivid-red-color has-text-color has-link-color wp-elements-c30047fd6e3d5e371a05b049cb4dbc95\"><strong>Can Fun-AudioGen-VD create voices without a reference recording?<\/strong><\/p>\n\n\n\n<p>Yes. That&#8217;s its primary use case. You describe the voice you want in text\u2014age, gender, accent, emotion, timbre\u2014and the model generates it from scratch.<\/p>\n\n\n\n<p class=\"has-vivid-red-color has-text-color has-link-color wp-elements-7df492d9509443197574836796555a40\"><strong>Is Fun-CosyVoice3.5 available outside China?<\/strong><\/p>\n\n\n\n<p>The cosyvoice-v3.5-plus and cosyvoice-v3.5-flash models are currently only available in Alibaba Cloud&#8217;s China mainland deployment (Beijing region). For international regions (Singapore), use cosyvoice-v3-plus or cosyvoice-v3-flash.<\/p>\n\n\n\n<p class=\"has-vivid-red-color has-text-color has-link-color wp-elements-c7fd10af5da865707c7bf68fd4bc443d\"><strong>How do I access these models?<\/strong><\/p>\n\n\n\n<p>Through Alibaba Cloud&#8217;s DashScope API and SDK. Documentation is available at https:\/\/help.aliyun.com\/zh\/model-studio\/text-to-speech and the cloning API reference at https:\/\/help.aliyun.com\/zh\/model-studio\/cosyvoice-clone-api.<\/p>\n\n\n\n<p class=\"has-vivid-red-color has-text-color has-link-color wp-elements-8e4eb1e617d03d12f64bff4b8c4d5e1f\"><strong>What&#8217;s the cost structure?<\/strong><\/p>\n\n\n\n<p>Creating, querying, updating, and deleting custom voices is free. Speech synthesis using cloned voices is billed per character of text synthesized.<\/p>\n\n\n\n<p class=\"has-vivid-red-color has-text-color has-link-color wp-elements-8147a16774c4b5557b75a6863d858852\"><strong>What audio quality does the source recording need to be?<\/strong><\/p>\n\n\n\n<p>Recommended: WAV format, 16kHz+ sample rate, mono channel, no background noise, no gaps longer than 2 seconds, at least 60% active speech content.<\/p>\n\n\n\n<p class=\"has-vivid-red-color has-text-color has-link-color wp-elements-306c8e8d6d0bfc5bf8b6967b4e3c1e2f\"><strong>Can I use Fun-AudioGen-VD for game audio?<\/strong><\/p>\n\n\n\n<p>Yes\u2014it&#8217;s one of the primary intended use cases. You can generate character voices, ambient soundscapes, and environmental audio from text descriptions, significantly reducing production time and recording costs.<\/p>\n\n\n\n<p class=\"has-vivid-red-color has-text-color has-link-color wp-elements-c1f50d628e3c334d84d940ce90cd1fce\"><strong>What&#8217;s the voice quota per account?<\/strong><\/p>\n\n\n\n<p>1,000 custom voices maximum. Voices that go unused for 12 months are automatically deleted to free quota.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Alibaba&#8217;s Fun-CosyVoice3.5 and Fun-AudioGen-VD redefine AI voice generation\u2014clone voices, design soundscapes &amp; control speech with plain text. Here&#8217;s everything you need to know.<\/p>\n","protected":false},"author":2,"featured_media":1790,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-1789","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-audio"],"_links":{"self":[{"href":"https:\/\/gaga.art\/blog\/wp-json\/wp\/v2\/posts\/1789","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/gaga.art\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/gaga.art\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/gaga.art\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/gaga.art\/blog\/wp-json\/wp\/v2\/comments?post=1789"}],"version-history":[{"count":1,"href":"https:\/\/gaga.art\/blog\/wp-json\/wp\/v2\/posts\/1789\/revisions"}],"predecessor-version":[{"id":1791,"href":"https:\/\/gaga.art\/blog\/wp-json\/wp\/v2\/posts\/1789\/revisions\/1791"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/gaga.art\/blog\/wp-json\/wp\/v2\/media\/1790"}],"wp:attachment":[{"href":"https:\/\/gaga.art\/blog\/wp-json\/wp\/v2\/media?parent=1789"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/gaga.art\/blog\/wp-json\/wp\/v2\/categories?post=1789"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/gaga.art\/blog\/wp-json\/wp\/v2\/tags?post=1789"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}