Stability AI shipped Stable Audio 3 this week — four model variants spanning music and sound-effects generation, with the production-relevant numbers actually disclosed for once. The family: **small-music** (459M diffusion transformer + 108M SAME-S autoencoder, ~567M total, 2-min max, music only), **small-sfx** (same parameter count, SFX only), **medium** (1.4B DiT + 852M SAME-L, ~2.25B total, 6m20s max, both domains), **large** (2.7B DiT + 852M SAME-L, ~3.55B total, 6m20s max, both domains). 44.1 kHz stereo output throughout. The architectural delta is the SAME autoencoder: **4096× downsampling ratio** via two-stage compression (256× patching + 16× transformer resampling), producing 256-dim latents at ~10.76 Hz for 44.1 kHz input. Prior audio autoencoders run 1024-2048× — Stable Audio's compression is 2-4× tighter, which is what makes the latency story possible.
The latency numbers on H200 are the headline: small-music generates 120 seconds of instrumental music in **0.45 seconds**. Medium: 0.78s for 120s music, 0.60s for 5s SFX. Large: 0.81s for 120s music, 0.64s for 5s SFX. Eight-step ping-pong sampling, no classifier-free guidance. That's faster-than-realtime by ~267× on small-music — interactive workflow territory, not batch-only. Quality benchmarks: large hits **FAD 0.101** on 120s instrumental music (Fréchet Audio Distance, lower better) and **CLAP 0.393** for text-audio alignment, with a **4.30/5 musicality MOS** from listener study (vs medium's 4.15). On 5s SFX: large FAD 0.358, CLAP 0.370. Editing capabilities: inpainting (single or multi-region — medium FAD 0.046 for single-region edits) and continuation via causal prefix masking. Outpainting not in scope.
Ecosystem read: this is the open-weights move that closes meaningful ground on the closed-source SOTA. Small and medium weights ship on HuggingFace under standard Stability licensing terms; the large variant is gated behind enterprise licensing. The release doesn't publish head-to-head comparisons against MusicGen, Suno, Udio, AudioLDM, or ElevenLabs Music — readers should treat the FAD/CLAP/MOS numbers as Stability's self-reported scoring, not a competitive shootout. For builders deploying audio generation in product, the workflow story is the differentiator: 0.45s for 120s music on H200 means a user-facing app can iterate audio in <1s per prompt without queueing. That's the latency floor that turns audio-gen from "render at submit, wait, deliver" into "scrub a generation parameter, hear the change immediately." Repo: github.com/Stability-AI/stable-audio-3.
Monday morning: if you're building audio-gen into a product (game audio, podcast/video creator tools, accessibility, music apps), test the medium variant locally — it's the sweet spot of open-weight + multi-domain + 6m20s duration. Inpainting at FAD 0.046 means you can offer "regenerate this 4-second section" UX without rebuilding the whole track. The large variant's enterprise gating is the catch — if your product needs the +0.15 musicality MOS improvement, plan for licensing conversations with Stability. Honest unaddressed gaps: no vocal generation discussion (instrumental + SFX only mentioned), no training-data disclosure (copyright questions for commercial music outputs remain open), no head-to-heads vs Suno/Udio (the obvious comps), no comparison to ElevenLabs Music. The open weights at small/medium are the architectural-template release; production deployments need to do their own licensing audit before commercial ship.
