Gemini 3.5 Flash beats 3.1 Pro on most evals, Antigravity 2.0 as agent-first IDE

Google announced Gemini 3.5 Flash at I/O 2026: outperforms Gemini 3.1 Pro on most benchmarks including coding and agentic tasks, claimed 4x faster than other frontier models with an optimized variant at 12x. Default in the Gemini app and AI Mode in Search globally as of today. Claims include autonomous execution of coding pipelines, managing research projects, building operating systems from scratch, and running independently for multiple hours with pause-for-human-judgment behavior. Antigravity 2.0 launched as a standalone desktop application — "agentic development platform and IDE where agents can live, work, and execute," co-developed with Gemini 3.5 Flash. Gemini Spark debuted as a 24/7 personal AI agent. Search rebuilt with agentic capabilities and generative interfaces. Numbers not disclosed in the announcement: context window, MMLU, SWE-bench Verified, and pricing.

The tier inversion is the move that matters. Historically Flash was Google's cheap sub-tier and Pro was the production frontier. With 3.5, Flash is now the production model and the Pro tier becomes ambiguous — Google didn't say whether 3.1 Pro stays as a deprecated option or what 3.5 Pro will be. The 4x speedup against other frontier models is plausible if benchmarked against OpenAI o-class and Anthropic Opus 4.7 at default settings, but Google didn't publish the harness or comparison details. The 12x-faster optimized variant raises immediate questions about quantization or distillation tradeoffs that weren't addressed. The autonomous-for-hours claims need adversarial benchmarking before they mean anything concrete — yesterday's Anthropic Capability Curve framing (62% to 87% on SWE-bench Verified over twelve months) is the kind of grounded number this announcement is missing. Antigravity 2.0 as a standalone desktop IDE is direct positioning against Claude Code (which just shipped Routines, Managed Agents, and the Capability Curve framing yesterday) and Cursor's background-agent track. "Co-developed with Gemini 3.5 Flash" means Google designed the IDE and model together — same vertical integration play Anthropic runs with Claude Code and Cursor with their model selection.

Three labs lined up on different bets. OpenAI: compute-and-scale Stargate trajectory. Anthropic: AI-assisted research velocity (yesterday's Karpathy hire), Capability Curve framing, infrastructure primitives via MCP and Managed Agents. Google: agent-first development environment plus full-stack integration (Search, Workspace, IDE, mobile). For builders evaluating IDEs, Antigravity 2.0 now joins Claude Code, Cursor, and OpenAI Codex in a four-way fight; the deeper question is whether agent-IDE integration is the right abstraction layer. Anthropic's MCP-as-protocol bet says no, value accrues at the protocol layer. Google's Antigravity bet says yes, the IDE itself becomes the agent operating environment. Google's tier inversion (Flash beats Pro) also signals model-pricing convergence — the price/performance gap between cheap and expensive frontier models is collapsing, which compresses the wrapper-ecosystem margin on model routing as a business strategy.

Monday: if you build on Gemini API, watch Google AI Studio docs for the actual deprecation path on 3.1 Pro and pricing on 3.5 Flash. The user-facing default switch doesn't tell you whether your production API integrations migrate automatically or require change. If you're evaluating IDEs for your team, Antigravity 2.0 belongs in the bake-off alongside Claude Code, Cursor, and Codex. Real test: pick a four-hour autonomous coding task in your repo (refactor a real subsystem, build a service from scratch with tests) and run it on all four with the same starting prompt. Compare wall-clock time to first valid PR, defect rate, and the quality of agent's judgment-pause prompts. The 12x-faster optimized variant should be eval-tested specifically against the standard model — quantization or distillation often degrades multi-step reasoning even when single-shot benchmarks hold. For Gemini Spark and personal agents in general, wait for adversarial reviews on what "manage your digital life 24/7" actually does, costs, and protects. The strongest part of Google's claims is they are testable; adversarial benchmarking over the next four weeks will tell you more than today's announcement does.

Gemini 3.5 Flash beats 3.1 Pro on most evals, Antigravity 2.0 as agent-first IDE

More News