Google launched Gemini 3.1 Flash Live today, a real-time conversational audio model that's rolling out in Search, Gemini apps, and developer APIs. The model claims significant improvements on audio benchmarks — topping ComplexFuncBench Audio for multi-step tasks and Big Bench Audio's 1,000-question reasoning test. However, it only manages 36.1% on Scale AI's MultiChallenge, which tests handling of hesitations and interruptions, while non-conversational audio models can hit 50%.
What's notable isn't just the performance gains, but Google's decision to embed SynthID watermarks in all outputs — invisible to humans but detectable by software. This suggests Google genuinely believes Flash Live sounds human enough to fool people, which would mark a meaningful leap from the stilted cadence that typically gives away AI speech. Companies like Home Depot and Verizon are already testing it for customer service applications.
This continues the pattern I noted in March when Google first claimed 90% performance on complex audio tasks but faced little real competition. Now we have actual deployment and benchmark numbers, though Google still won't specify latency figures beyond claiming it has "the speed you need" — presumably under the 300ms threshold researchers consider optimal for natural conversation.
For developers, Flash Live is available through AI Studio, the Gemini API, and Gemini Enterprise for Customer Experience. The watermarking requirement signals this isn't just another incremental improvement — Google expects this model to be convincing enough that distinguishing human from AI voice will become a real problem. Whether that's justified remains to be seen, but the 36% interruption handling score suggests we're not quite at human-level conversation yet.
