Cartesia released two models this week and put a bold label on both: Sonic-3.5 for text to speech and Ink-2 for speech to text, billed as the number one streaming models for each task. Sonic-3.5 is the headline, a text-to-speech model the company calls the most natural streaming TTS by human preference, with a 82ms time-to-first-audio, new crosslingual voices, and support for personal voice clones. Ink-2 is the quieter half, a speech-to-text model with built-in turn detection, the feature that lets a system know when a speaker has actually finished talking.
The number-one claim deserves a caveat, and it is the kind worth stating plainly. The Artificial Analysis text-to-speech leaderboard that Cartesia's own announcement links to ranks Sonic 3.5 fourth overall, with an Elo of 1205, behind Fun-Realtime-TTS, Gemini 3.1 Flash TTS, and a research-preview model. So the crown is real only inside a narrower framing, fastest or best among production streaming models on a particular axis, not the independent top of the board. When a launch leads with a superlative the cited scoreboard does not support, the honest move is to read past the superlative.
Read past it and the release is still genuinely interesting, because the parts that hold up are the parts that matter for voice agents. An 82ms time-to-first-audio is low enough that a reply starts before a person registers a pause, and turn detection in the speech-to-text half is what keeps an agent from talking over someone or sitting in dead air. Put together, TTS, STT, and turn detection from a single vendor are the primitives of a full-duplex voice loop, the thing every company building a phone agent or a live assistant is currently stitching together from parts.
That is the real signal here, and it is a procurement signal more than a benchmark one. The voice-agent stack is consolidating: instead of gluing a TTS vendor to a separate STT vendor to a separate turn-detection heuristic, a builder can take the loop from one place tuned to work together. Whether Sonic-3.5 is first or fourth on any given leaderboard matters less than whether the round trip feels instant and the model knows when to stop. On those terms the latency number is the one to watch, and the leaderboard rank is the one to take with a grain of salt.
