Sakana released KAME, a tandem speech-to-speech architecture that resolves the trade builders running voice agents have been stuck with: cascaded pipelines (STT โ LLM โ TTS) hit 2.1-second median first-response latency but carry full LLM knowledge depth, while pure end-to-end S2S like Moshi runs at ~80ms token cycles but loses the depth. KAME pairs both โ Moshi-class S2S front-end, streaming STT+LLM back-end running asynchronously, and a fourth signal channel they call the "oracle stream" feeding LLM predictions into the S2S generator while it's already producing audio. Weights, paper, and inference code are public on Hugging Face and GitHub.
The mechanism is the interesting part. Moshi's original design uses three streams โ input audio, inner monologue text, output audio โ co-modeled in a single transformer. KAME adds a fourth: oracle tokens generated by the back-end LLM as the user's transcript progressively completes. The back-end isn't waiting for utterance end; it's running predictions on the partial transcript and refining as more audio comes in. Those oracle tokens stream into the S2S model, which conditions ongoing audio generation on both internal context and incoming oracle. Result: first-token latency stays at Moshi's ~80ms, while the response content carries the back-end LLM's knowledge depth. The back-end is decoupled enough that the S2S can keep generating ambient acoustic continuity while the LLM is still thinking โ the "speaking while thinking" framing in the paper. Training used 56,582 synthetic dialogues converted from MMLU-Pro, GSM8K, and HSSBench text into audio, with eval on MT-Bench reasoning/STEM/humanities (coding, extraction, math were excluded as unsuitable for speech tasks).
The ecosystem read is that Sakana is closing a real gap in the voice-agent stack. Cascaded systems dominated the production deployment story for two years because LLM knowledge depth was the value-add โ you could tolerate 2s lag for the right answer. End-to-end S2S like Moshi (and OpenAI's Realtime API class of model) trade depth for naturalness, and have stayed niche in customer-service production because callers notice when the agent doesn't actually know what it's talking about. KAME is the first architecture to ship publicly that breaks that trade convincingly, and it does so without retraining either component from scratch โ Moshi remains the front-end, an LLM (presumably Sakana's own; the paper specifies their TinySwallow-class) handles back-end. For builders running voice agents, this means the assumption "you pick latency or knowledge" is wrong starting now; the architecture template exists, the weights are public, the eval set is reproducible.
Concrete moves: if you're running a cascaded voice agent today and lag is a top complaint, KAME's tandem template is worth a prototype week โ the latency win is large enough to test on a single customer-service flow. If you're running pure-S2S and your agent gets caught not knowing things (the typical Moshi production failure mode), the oracle-stream pattern is portable to other front-ends, not just Sakana's checkpoint. The eval boundary worth flagging: KAME was tested on MT-Bench reasoning/STEM/humanities, not on coding or extraction or math โ those failed in early training as speech-incompatible, and you should not assume KAME's audio outputs are well-formed for code dictation or numeric extraction. For domains where structured-output fidelity matters more than naturalness, the cascaded pipeline still wins.
