Andon Labs — the AI-safety startup behind Anthropic's Project Vend last year — put Gemini in charge of a Vienna café called "Café-Faire" for a month. The agent, named Mona, set up electricity and internet, placed LinkedIn job ads, secured outdoor seating permits, and opened wholesaler accounts. It also ordered 3,000 rubber gloves for a handful of employees, 6,000 napkins, 4 first-aid kits, and canned tomatoes that weren't on the menu. Final tally over roughly a month: $5,700 in sales against more than $16,000 in spending from a $21,000 budget — a $10,300 loss. The diagnosed cause: context-window limits making Mona forget past orders.
This is the second Andon Labs long-horizon agent eval to land publicly. The first, Anthropic's Project Vend (Claude managing a vending machine), the AP describes as "even more disastrous" — abusive behavior toward customers and wasteful spending. Café-Faire makes the failure mode legible: agents can handle one-off setup tasks like utilities, hiring ads, permits, and supplier accounts because each is a self-contained sequence of API calls. They cannot reliably handle inventory management because that requires remembering prior purchases over weeks, and the context window doesn't extend that far. Mona double-ordered because she had no persistent ledger of what she'd already bought. Andon Labs didn't disclose which Gemini version was used, but the article frames this as the current frontier-class model — meaning the context-window memory bottleneck is the constraint at frontier scale, not a small-model artifact. The specific failures (3K gloves, 6K napkins, canned tomatoes off-menu) look absurd in isolation, but they're structurally inevitable when an agent has no durable state.
Long-horizon agent management is exactly the workload Anthropic shipped to public beta last week with Multiagent Orchestration + Outcomes (the grader-in-its-own-context architecture), and the same problem space Signadot's `/signadot-validate` skill targets for Kubernetes deploys (per-agent sandboxes with routing-key isolation). The pattern across all of these: frontier-lab agent products are mostly bottlenecked on memory and state, not on raw model capability. Andon Labs' value as an eval team is naming this constraint with specific dollar losses across multiple labs — Anthropic Project Vend, now Google Gemini Café-Faire. Expect similar results when someone runs the same shape against GPT-5.5, Llama, DeepSeek. The diagnosis is consistent with what Anthropic's own "Dreaming" memory-curation feature (announced at Code with Claude 2026) is trying to solve. The cycle going forward is predictable: Andon Labs runs eval, finds context-window failure, frontier labs ship a memory/dreaming/agent-state product, next eval rerun, repeat. The interesting open question is whether persistent agent memory can be solved with retrieval + structured logs, or whether it requires architectural changes — state tokens, neural memory modules, true long-context windows that don't degrade.
Andon Labs is establishing itself as the agent-eval equivalent of what METR has become for autonomous-research evals — running long-horizon real-world tests at frontier-lab scale and publishing legible failure modes with dollar figures attached. For anyone shipping an agent product in production right now: budget for an Andon-Labs-style failure (your agent will forget past actions and repeat them) and build durable state outside the agent's context window — a structured ledger, a memory store, a database the agent has to read from before deciding. For the broad audience: "AI is going to run businesses" is the marketing pitch; "AI orders 6,000 napkins because it forgot it bought 4,000 last week" is the substance. Café-Faire is more useful as a benchmark than as a story. The $10,300 loss number is going to get cited a lot.
