Anthropic released a three-agent harness designed to solve one of autonomous coding's biggest problems: AI agents that lose their way during long development sessions. The system splits work between dedicated planning, generation, and evaluation agents, using structured handoffs and context resets to maintain coherence across multi-hour coding runs that can last up to four hours and involve 5-15 iterations.
This addresses what I've been seeing in production AI workflows â agents that start strong but drift into incoherence as context windows fill up. Anthropic's approach of using separate evaluator agents is particularly smart. As Prithvi Rajasekaran from Anthropic Labs notes, "Separating the agent doing the work from the agent judging it proves to be a strong lever" because agents consistently overrate their own output, especially on subjective tasks like UI design. The evaluator uses Playwright to actually navigate and test generated interfaces, providing concrete feedback rather than self-congratulation.
What stands out from the industry response is how this tackles the "amnesia problem" that kills most long-running agents. Artem Bredikhin nailed it on LinkedIn: "every new context window is amnesia." Anthropic's structured handoffs with JSON specs and enforced testing create continuity that compaction techniques can't match. Where compaction preserves context but makes models timid about approaching limits, this system embraces fresh starts with proper state transfer.
For developers building AI workflows, this validates the pattern we're seeing work: specialized agents with clear boundaries beat general-purpose agents trying to do everything. If you're building coding assistants or design tools, the separate evaluation pattern is worth copying â just make sure your evaluator has real testing capabilities, not just another LLM giving opinions.
