Stable Audio 3: 4-model family, H200 पर 120s music के लिए 0.45s, weights open

Stability AI ने इस हफ़्ते Stable Audio 3 shipped किया — चार model variants जो music और sound-effects generation को span करते हैं, इस बार production-relevant numbers actually disclose किए गए हैं। Family: **small-music** (459M diffusion transformer + 108M SAME-S autoencoder, ~567M total, max 2-min, music only), **small-sfx** (same parameter count, SFX only), **medium** (1.4B DiT + 852M SAME-L, ~2.25B total, max 6m20s, both domains), **large** (2.7B DiT + 852M SAME-L, ~3.55B total, max 6m20s, both domains)। पूरे में 44.1 kHz stereo output। Architectural delta SAME autoencoder है: **4096× downsampling ratio** two-stage compression के through (256× patching + 16× transformer resampling), 44.1 kHz input के लिए ~10.76 Hz पर 256-dim latents produce करता है। पहले के audio autoencoders 1024-2048× पर चलते हैं — Stable Audio का compression 2-4× tighter है, यही latency story को possible बनाता है।

H200 पर latency numbers headline हैं: small-music 120 seconds का instrumental music **0.45 seconds** में generate करता है। Medium: 0.78s 120s music के लिए, 0.60s 5s SFX के लिए। Large: 0.81s 120s music के लिए, 0.64s 5s SFX के लिए। Eight-step ping-pong sampling, कोई classifier-free guidance नहीं। small-music पर real-time से ~267× तेज़ — interactive workflow territory, batch-only नहीं। Quality benchmarks: large 120s instrumental music पर **FAD 0.101** hit करता है (Fréchet Audio Distance, कम बेहतर) और text-audio alignment के लिए **CLAP 0.393**, listener study से **musicality MOS 4.30/5** के साथ (medium के 4.15 vs)। 5s SFX पर: large FAD 0.358, CLAP 0.370। Editing capabilities: inpainting (single या multi-region — medium FAD 0.046 single-region edits के लिए) और causal prefix masking के through continuation। Outpainting scope में नहीं।

Ecosystem read: यह open-weights move है जो closed-source SOTA पर meaningful ground close करता है। Small और medium weights HuggingFace पर standard Stability licensing terms के तहत ship होते हैं; large variant enterprise licensing के पीछे gated है। Release MusicGen, Suno, Udio, AudioLDM, या ElevenLabs Music के against head-to-head comparisons publish नहीं करता — readers को FAD/CLAP/MOS numbers को Stability की self-reported scoring treat करना चाहिए, competitive shootout नहीं। Product में audio generation deploy कर रहे builders के लिए, workflow story differentiator है: H200 पर 120s music के लिए 0.45s मतलब user-facing app बिना queueing के per prompt <1s में audio iterate कर सकती है। यही latency floor है जो audio-gen को "render at submit, wait, deliver" से "scrub a generation parameter, hear the change immediately" में बदलता है। Repo: github.com/Stability-AI/stable-audio-3।

Monday सुबह: अगर तुम product (game audio, podcast/video creator tools, accessibility, music apps) में audio-gen build कर रहे हो, medium variant locally test करो — यह open-weight + multi-domain + 6m20s duration का sweet spot है। Inpainting FAD 0.046 पर मतलब तुम "regenerate this 4-second section" UX offer कर सकते हो without whole track को rebuild किए। Large variant का enterprise gating catch है — अगर तुम्हारे product को +0.15 musicality MOS improvement चाहिए, Stability के साथ licensing conversations के लिए plan करो। Honest unaddressed gaps: कोई vocal generation discussion नहीं (only instrumental + SFX mentioned), training-data disclosure नहीं (commercial music outputs के लिए copyright questions open रहते हैं), Suno/Udio के against कोई head-to-heads नहीं (obvious comps), ElevenLabs Music के साथ कोई comparison नहीं। small/medium पर open weights architectural-template release है; production deployments को commercial ship से पहले अपना licensing audit करना होगा।

Stable Audio 3: 4-model family, H200 पर 120s music के लिए 0.45s, weights open

और समाचार