Microsoft Webwright: 1000 lines, Playwright as actions, Odysseys 33.5 to 60.1%

Microsoft Research dropped Webwright this week — a web agent framework that throws out browser DOM clicking and screenshot-coordinate prediction in favour of having the agent write Playwright code inside a terminal. The architectural bet: treat the browser as a launchable tool, not a stateful session. The agent receives context, returns code and reasoning, executes it through a Terminal Environment, and incorporates the observations (logs, screenshots, return values) back into context. Three components, about 1000 lines total: Runner at ~150 LOC orchestrates the loop, Model Endpoint at ~550 LOC handles the LLM interface, Terminal Environment at ~300 LOC executes everything. Single agent loop, no multi-agent orchestration. For builders who've watched the browser-agent stack accumulate Operator-style DOM wrappers and screenshot pipelines, this is the architectural minimalism move.

Benchmarks: Odysseys (long-horizon multi-site browsing, tasks averaging 272.3 words) — GPT-5.4 base 33.5%, Webwright on GPT-5.4 lifts it to **60.1%** (79.4% relative improvement). Prior state-of-the-art on Odysseys was Opus 4.6 at 44.5%, set in April 2026. Claude Opus 4.7 with Webwright completes tasks in fewer steps (mean 21.9 vs 26.3) but at $6.09 per task versus GPT-5.4's $2.37 — the cost/step tradeoff is real and explicit. Online-Mind2Web (300 tasks, 136 sites): Webwright+GPT-5.4 hits 86.67% accuracy. Qwen3.5-9B with pre-built tool scripts: 66.2% on the hard split. Engineering caveats Microsoft documents honestly: models prematurely declare "done" without finishing, mitigated by self-reflection plus fresh-folder validation plus explicit success/failure judgment; context explosion on long trajectories, mitigated by compacting history every 20 steps.

Ecosystem read: this is the second major browser-agent release in a fortnight after Microsoft's own Fara1.5 family of 4B/9B/27B browser models. Fara was the model side; Webwright is the harness. The two represent a coherent stance — keep the model surface minimal and let Playwright code (Microsoft's own browser-automation library, originally for testing) carry the action vocabulary. That's a different bet from OpenAI's Operator (DOM-tree perception, click coordinates) and Google's Antigravity 2.0 (browser-as-runtime). For builders, the implication is concrete: if you've been writing custom DOM-scraping harnesses or wrestling with screenshot-to-coordinate prediction, the Playwright-code-as-action-language path now has a published baseline that beats the prior SOTA by 15.6 absolute points on Odysseys. Repository: github.com/microsoft/Webwright. Ships with a Claude Code skill — no separate LLM key beyond a Claude subscription, with project-scoped or user-scoped install paths.

Monday morning: if you're shipping a web-agent product, clone the repo and run the Odysseys split against your current harness — the apples-to-apples comparison is what tells you whether your DOM-walker is doing real work or whether a Playwright-code generator on the same base model would do better. The 1000-LOC budget makes that test cheap to set up. If you're prototyping web agents from scratch, the Webwright shape (Runner / Model Endpoint / Terminal Env) is a reasonable starting decomposition — small enough to read in an evening, structured enough to extend. The cost/step tradeoff with Opus 4.7 is also worth modelling explicitly in your budget: $2.37 vs $6.09 per task with Opus may or may not be worth the 4.4-step reduction depending on what your agent is actually paid to do.

Microsoft Webwright: 1000 lines, Playwright as actions, Odysseys 33.5 to 60.1%

More News