Cartesia: Definition & Meaning — AI Wiki

基於 state space model(SSM)架構而非 Transformer 建構的 Voice AI 新創。他們的 Sonic 模型實現超低延遲語音生成,讓即時對話 AI 第一次感覺真正自然。

為什麼重要

Cartesia 重要是因為他們證明了 state space model 不只是研究好奇,而是即時 Voice AI 商業上可行的架構。他們亞 100 毫秒的延遲第一次讓真正自然的對話 AI 成為可能,拉近「和 bot 說話」和「和人說話」之間的距離。當產業轉向 voice-first AI agent,Cartesia 在串流速度上的架構優勢可能讓他們成為所有其他人都要在其上建構的基礎設施層。

Deep Dive

Cartesia was founded in 2023 by a team of researchers from Stanford, including Karan Goel, Albert Gu, and others who had been deeply involved in the development of state space models (SSMs). Albert Gu is widely credited as the architect of the S4 and Mamba architectures — the sequence modeling breakthroughs that demonstrated transformers were not the only viable path for deep learning on sequential data. Cartesia spun out of that research with a specific thesis: SSMs could deliver voice AI with fundamentally lower latency and better streaming characteristics than transformer-based approaches, and the time to commercialize that advantage was now.

The State Space Model Bet

The technical core of Cartesia's approach is genuinely different from most voice AI companies. While competitors like ElevenLabs and PlayHT build on transformer architectures (or hybrid systems that lean heavily on attention mechanisms), Cartesia's Sonic models are built natively on SSM architecture. The practical consequence is significant: SSMs process sequences in linear time relative to length, versus the quadratic scaling of standard attention. For voice generation specifically, this means Sonic can produce speech with end-to-end latency under 100 milliseconds — fast enough that in a conversational AI application, the response feels instantaneous rather than "slightly delayed." This is not a marginal improvement; it is the difference between a voice assistant that feels like a phone call and one that feels like talking to a machine.

Sonic and the Product Suite

Cartesia launched Sonic as their flagship model, and it quickly gained attention for both its speed and its quality. Sonic supports multiple languages, voice cloning from short samples, and fine-grained control over speaking style, pace, and emotion. Their API is designed for real-time applications — the kind of streaming, bidirectional voice interactions that agents and voice assistants need. In 2024, they released Sonic 2, which improved naturalness and expanded language support while maintaining the ultra-low latency that had become their signature. The company also offers an on-premises deployment option, which matters for healthcare, finance, and government customers who cannot send audio to third-party servers.

Funding and Positioning

Cartesia raised $27 million in a Series A in 2024, with investors including Lightspeed Venture Partners and Index Ventures. For a company less than two years old at the time, that reflected the market's confidence in both the SSM approach and the team's pedigree. Their positioning is distinctive: while ElevenLabs competes primarily on voice quality and breadth, and Deepgram on transcription speed, Cartesia is staking out the "fastest real-time voice generation" claim and building everything around it. The bet is that as AI agents become the primary interface for software — replacing buttons and forms with conversation — the voice layer needs to be as fast as a human interlocutor, and SSMs are the architecture that gets you there.

Why Architecture Matters

Cartesia's existence is, in some ways, a referendum on whether architectural innovation still matters in an era dominated by scaling laws and data. Their answer is unequivocally yes. The same amount of compute that buys you a good transformer voice model buys you a faster, more efficient SSM voice model — and in real-time applications, that efficiency gap translates directly into user experience. Whether Cartesia remains an independent company or gets acquired for its technology, they have already proven that the SSM family of architectures has commercial legs well beyond the research lab.

Cartesia