Microsoft Webwright: 1000 lines, actions के तौर पर Playwright, Odysseys 33.5 से 60.1%

Microsoft Research ने इस हफ़्ते Webwright drop किया — एक web agent framework जो browser DOM clicking और screenshot-coordinate prediction को छोड़कर agent को terminal के अंदर Playwright code लिखवाता है। Architectural bet: browser को launchable tool मानो, stateful session नहीं। Agent context receive करता है, code और reasoning return करता है, Terminal Environment के through execute करता है, और observations (logs, screenshots, return values) वापस context में incorporate करता है। तीन components, कुल लगभग 1000 lines: Runner ~150 LOC loop orchestrate करता है, Model Endpoint ~550 LOC LLM interface handle करता है, Terminal Environment ~300 LOC सब execute करता है। Single agent loop, कोई multi-agent orchestration नहीं। जिन builders ने browser-agent stack में Operator-style DOM wrappers और screenshot pipelines जमा होते देखे हैं, यह architectural minimalism move है।

Benchmarks: Odysseys (long-horizon multi-site browsing, tasks average 272.3 words) — GPT-5.4 base 33.5%, Webwright GPT-5.4 पर इसे **60.1%** तक उठाता है (79.4% relative improvement)। Odysseys पर पिछला SOTA Opus 4.6 था 44.5% पर, April 2026 में set। Claude Opus 4.7 with Webwright tasks कम steps में पूरा करता है (mean 21.9 vs 26.3) पर $6.09 per task बनाम GPT-5.4 के $2.37 — cost/step tradeoff real और explicit है। Online-Mind2Web (300 tasks, 136 sites): Webwright+GPT-5.4 86.67% accuracy hit करता है। Qwen3.5-9B pre-built tool scripts के साथ: hard split पर 66.2%। Microsoft जो engineering caveats honestly document करता है: models बिना finish किए prematurely "done" declare करते हैं, self-reflection plus fresh-folder validation plus explicit success/failure judgment से mitigate होता है; long trajectories पर context explosion, हर 20 steps पर history compaction से mitigate होती है।

Ecosystem read: यह Microsoft की अपनी Fara1.5 family (4B/9B/27B models) के बाद पंद्रह दिन में दूसरा major browser-agent release है। Fara model side था; Webwright harness है। दोनों एक coherent stance represent करते हैं — model surface minimal रखो और Playwright code (Microsoft की अपनी browser-automation library, originally testing के लिए) को action vocabulary carry करने दो। यह OpenAI के Operator (DOM-tree perception, click coordinates) और Google के Antigravity 2.0 (browser-as-runtime) से अलग bet है। Builders के लिए implication concrete है: अगर तुमने custom DOM-scraping harnesses लिखे हैं या screenshot-to-coordinate prediction से जूझ रहे हो, Playwright-code-as-action-language path के पास अब एक published baseline है जो Odysseys पर पिछले SOTA को 15.6 absolute points से हराता है। Repository: github.com/microsoft/Webwright। एक Claude Code skill के साथ ship — Claude subscription के अलावा कोई separate LLM key नहीं चाहिए, project-scoped या user-scoped install paths के साथ।

Monday सुबह: अगर तुम web-agent product ship कर रहे हो, repo clone करो और Odysseys split अपने current harness के against चलाओ — apples-to-apples comparison बताती है कि तुम्हारा DOM-walker real काम कर रहा है या same base model पर एक Playwright code generator बेहतर करेगा। 1000-LOC budget यह test सस्ता setup करने लायक बनाता है। अगर तुम scratch से web agents prototype कर रहे हो, Webwright shape (Runner / Model Endpoint / Terminal Env) एक reasonable starting decomposition है — एक शाम में पढ़ने जितनी छोटी, extend करने के लिए काफ़ी structured। Opus 4.7 के साथ cost/step tradeoff को भी अपने budget में explicitly model करना worth है: $2.37 vs $6.09 per task Opus के साथ, यह 4.4-step reduction के worth है या नहीं, यह इस पर depend करता है कि तुम्हारा agent actually किस के लिए paid है।

Microsoft Webwright: 1000 lines, actions के तौर पर Playwright, Odysseys 33.5 से 60.1%

और समाचार