Thinking Machines bets on native real-time models over voice-mode scaffolding, Zubnet AI News

Mira Murati's Thinking Machines emerged this week with their first substantial technical pitch: "interaction models" — AI systems that perceive and respond continuously across audio, video, and text rather than waiting for users to finish typing or speaking. The architectural bet is that real-time collaboration belongs inside the model, not stitched on top via voice-activity-detection plus turn-taking heuristics. Their citation of Sutton's bitter lesson is the unsubtle part: bolted-on interactivity will get outpaced by models trained from scratch for it.

The architecture is two-tier, trained from scratch. The interaction model maintains constant two-way exchange with the user — continuous perception of audio and video streams, multi-stream micro-turn design, time-awareness, and dialog management without a separate VAD or turn-detection component. The background model runs asynchronously and handles sustained reasoning, tool use, search, and longer-horizon work. The interaction model delegates to it when deeper thought is needed, then weaves the results back into the live conversation. Claimed capabilities: the model tracks whether the speaker is thinking, yielding, or self-correcting (no separate dialog manager); can interject verbally or visually as needed; can speak concurrently with the user (live translation); has explicit time awareness; can do simultaneous tool calls, web search, or generative UI while still listening. Thinking Machines claims the interaction model alone is "competitive on both interactive and intelligence benchmarks" but doesn't share specific numbers. They distinguish from contemporary specialized voice models (Moshi, PersonaPlex, Nemotron VoiceChat, GPT-Realtime-Translate) and credit prior work from Qwen-omni, KAME, and MoshiRAG as the architectural ancestors.

OpenAI's GPT-Realtime, Anthropic's voice mode, and Google's Gemini Live all use a similar shape: foundation model + VAD + text-to-speech + turn management on top. Thinking Machines' bet is that this gets outpaced by native real-time training. The argument has teeth: real-time robotics and autonomous-vehicle stacks already work this way (continuous bidirectional perception, no waiting for "user finished speaking"), and voice-only models like Kyutai's Moshi proved end-to-end audio is feasible at small scale. Thinking Machines is generalizing the pattern across modalities and adding the background-model split for hard reasoning — closer to how humans actually collaborate, where you can think slowly about a problem while still nodding and saying "uh-huh" in real time. The catch: voice/video-native training is data-expensive and compute-expensive, and TM hasn't shipped scaling numbers. If the architecture works, this is a real different shape for live AI products — agents that genuinely converse rather than turn-take. If it doesn't, it's an expensive bet against a frontier-lab pipeline that's been working "well enough" for two years.

Research preview only — not available to try yet. Limited research preview "in coming months," wider release "later this year." Thinking Machines was founded February 2025 by Murati after she left OpenAI as CTO; the lab has since lost staff to Meta and back to OpenAI, which sets a higher bar for "they actually ship" than an established lab gets. The technical pitch is real and worth tracking. The bitter-lesson framing also acts as a public commitment device: they've now publicly tied their architectural identity to "no scaffolding," which makes it harder for them to quietly fall back to voice-mode-plus-pipeline if the from-scratch training scaling doesn't pan out. Demo examples shown: tracking mentions of animals in a story, real-time speech translation, and posture correction (telling someone when they're slouching). Concrete enough to be a research artifact, not a product yet.

Thinking Machines bets on native real-time models over voice-mode scaffolding

More News