Cartesia was founded in 2023 by a team of researchers from Stanford, including Karan Goel, Albert Gu, and others who had been deeply involved in the development of state space models (SSMs). Albert Gu is widely credited as the architect of the S4 and Mamba architectures — the sequence modeling breakthroughs that demonstrated transformers were not the only viable path for deep learning on sequential data. Cartesia spun out of that research with a specific thesis: SSMs could deliver voice AI with fundamentally lower latency and better streaming characteristics than transformer-based approaches, and the time to commercialize that advantage was now.
The technical core of Cartesia's approach is genuinely different from most voice AI companies. While competitors like ElevenLabs and PlayHT build on transformer architectures (or hybrid systems that lean heavily on attention mechanisms), Cartesia's Sonic models are built natively on SSM architecture. The practical consequence is significant: SSMs process sequences in linear time relative to length, versus the quadratic scaling of standard attention. For voice generation specifically, this means Sonic can produce speech with end-to-end latency under 100 milliseconds — fast enough that in a conversational AI application, the response feels instantaneous rather than "slightly delayed." This is not a marginal improvement; it is the difference between a voice assistant that feels like a phone call and one that feels like talking to a machine.
Cartesia launched Sonic as their flagship model, and it quickly gained attention for both its speed and its quality. Sonic supports multiple languages, voice cloning from short samples, and fine-grained control over speaking style, pace, and emotion. Their API is designed for real-time applications — the kind of streaming, bidirectional voice interactions that agents and voice assistants need. In 2024, they released Sonic 2, which improved naturalness and expanded language support while maintaining the ultra-low latency that had become their signature. The company also offers an on-premises deployment option, which matters for healthcare, finance, and government customers who cannot send audio to third-party servers.
Cartesia raised $27 million in a Series A in 2024, with investors including Lightspeed Venture Partners and Index Ventures. For a company less than two years old at the time, that reflected the market's confidence in both the SSM approach and the team's pedigree. Their positioning is distinctive: while ElevenLabs competes primarily on voice quality and breadth, and Deepgram on transcription speed, Cartesia is staking out the "fastest real-time voice generation" claim and building everything around it. The bet is that as AI agents become the primary interface for software — replacing buttons and forms with conversation — the voice layer needs to be as fast as a human interlocutor, and SSMs are the architecture that gets you there.
Cartesia's existence is, in some ways, a referendum on whether architectural innovation still matters in an era dominated by scaling laws and data. Their answer is unequivocally yes. The same amount of compute that buys you a good transformer voice model buys you a faster, more efficient SSM voice model — and in real-time applications, that efficiency gap translates directly into user experience. Whether Cartesia remains an independent company or gets acquired for its technology, they have already proven that the SSM family of architectures has commercial legs well beyond the research lab.