A Microsoft Research preprint just dropped a benchmark called DELEGATE-52 and the headline number is sharp: GPT-5.4, Claude Opus 4.6, and Gemini 3.1 Pro corrupt an average of 25% of document content by the end of long delegated workflows. The paper โ€” "LLMs Corrupt Your Documents When You Delegate" by Philippe Laban, Tobias Schnabel, and Jennifer Neville โ€” evaluates 19 LLMs total across 52 professional domains that include coding, crystallography, and music notation. It is a preprint, not yet peer reviewed, and the benchmark deliberately tests extended delegated workflows rather than single-turn requests. That framing matters: this is not "can a model edit one document" but "what happens when you hand a model a multi-step editing job and walk away."

The three named degradation factors are concrete and testable: document size, interaction length, and the presence of distractor files in the working context. All three make corruption worse in the reported results. That maps directly to the operational shape of agentic workflows in the wild โ€” long contexts, many turns, many adjacent files the agent can see โ€” and explains why teams running long Claude Code or Codex sessions on real codebases have been reporting the same class of failures anecdotally. The benchmark gives that anecdote a number, attached to specific frontier model versions, with a published harness rather than vendor self-report. The companion Futurism coverage notes that Microsoft's own Copilot was excluded from the frontier-model evaluation โ€” read that how you want, but the absence is worth flagging.

The honest caveats: the 25% is an average across 52 domains, and averages hide variance โ€” without the per-domain breakdown, you cannot tell whether coding documents corrupt at 5% and crystallography at 60%, or whether the result is uniform. The abstract does not pin down the operational definition of "content corruption" โ€” whether that means factual errors, syntactic breakage, lost sections, hallucinated additions, or some weighted composite. Preprint status means the methodology will get pulled apart in review, and the harness specifics matter for any team trying to reproduce. None of this invalidates the headline; it just means the headline is the start of the read, not the end of it.

For builders shipping delegated workflows: the practical implication is that "send the model a document and ask it to edit through ten steps" is not yet a safe abstraction at frontier-model quality. Either keep the human in every commit, shorten the delegation horizon, or use document-diff verification at each step before propagating changes. Watch arxiv.org/abs/2604.15597 for the full PDF and the per-domain numbers when the paper updates โ€” those numbers will tell you which specific kinds of document work are still safe to delegate and which are the 25%-corruption end of the distribution.