Mistral released Voxtral-4B-TTS on March 26, claiming it beats ElevenLabs v2.5 Flash in human evaluations with 62.8% preference scores. The 4-billion parameter model runs on 3GB VRAM, supports 9 languages, and promises zero-shot voice cloning from 3-second audio samples. But there's a catch: Mistral removed the audio autoencoder weights from the open release, meaning developers can only use Mistral's 20 preset voices, not clone arbitrary voices locally.
This is classic AI company behavior â promise open source, deliver something neutered. The technical achievement is real: Voxtral uses an autoregressive LLM backbone (Ministral 3B) that generates 80ms audio tokens, with a sophisticated head combining semantic and acoustic components. The quality appears legitimate based on independent testing. But without the full encoder, "open weights" becomes marketing speak for "demo version."
The broader ecosystem is already working around Mistral's limitations. Course creators are building training around the API-only voice cloning at $0.016 per thousand characters versus ElevenLabs' $22/month subscription. The CC-BY-NC license blocks commercial self-hosting anyway, pushing serious users toward Mistral's paid API regardless. Some researchers are investigating whether audio representations can be reconstructed without the missing encoder weights, though success remains unclear.
For developers, this represents the current state of "open" AI: impressive capabilities with strategic limitations that funnel users toward paid services. Voxtral's quality and efficiency are noteworthy, especially for multilingual applications, but the voice cloning handicap makes it less compelling than initially promised. Unless you're fine with preset voices or paying API fees, ElevenLabs remains the better choice for custom voice work.
