Google published a preprint this week describing Auto-Diagnose, an LLM-based system that reads integration-test logs and tells engineers why a test failed. The motivation is an internal Google survey showing 38.4% of integration test failures take more than an hour to diagnose and 8.9% take more than a day. Unit tests get near-instant triage because the failure surface is a single function; integration tests fan out across services, data centers, and runtime layers, so root-cause analysis becomes a log-archaeology job. Auto-Diagnose automates that archaeology with prompt engineering on a frontier model, not a fine-tuned one.

The model is Gemini 2.5 Flash at temperature 0.1 and top-p 0.8, with no fine-tuning on Google's test corpus. The pipeline collects test driver logs and component logs across data centers, joins them chronologically into a single stream, and submits the whole thing to the model. Average payload per execution: 110,617 input tokens and 5,962 output tokens. Latency is p50 56 seconds and p90 346 seconds. The prompt walks the model through explicit phases: scan log sections, read context, locate failures, summarize errors, then conclude. The critical engineering choice is a hard anti-hallucination constraint that forces the model to refuse rather than guess when evidence is insufficient. That refusal behavior is what keeps accuracy above 90% in a domain where a confidently wrong diagnosis wastes hours of engineer time.

Production numbers from May 2025 onward: 224,782 test executions evaluated, 52,635 distinct failing tests diagnosed. Manual evaluation on 71 real-world failures across 39 teams landed at 90.14% root-cause accuracy. Developer feedback is 84.3% "please fix" from reviewers, 62.96% helpfulness ratio among developer responses, and rank 14 of 370 internal Critique tools (top 3.78%) by helpfulness. What's notable is what's missing: no fine-tuning, no RAG layer, no custom model. Just Gemini 2.5 Flash with careful prompting and a refusal-on-ambiguity rule. The system also benefits from Google's extreme log centralization, so you cannot just ship the same prompt on AWS CloudWatch and get the same numbers, because the prompt assumes logs are already joined chronologically across services.

If you run any kind of multi-service CI, the playbook here is replicable but not cheap. Model cost is negligible (Gemini 2.5 Flash at ~116k tokens per triage is pennies), so the real investment is log plumbing: collecting, normalizing, and joining across services before the LLM sees anything. The refusal-on-ambiguity pattern is the single most transferable idea. Most LLM pipelines in CI are not refusal-tuned and end up hallucinating causes that look plausible, which is worse than silence because it routes engineers toward wrong fixes. If you are wiring up LLM triage for your test suite, copy that pattern first, then worry about the model. The second lesson is that off-the-shelf frontier model with careful prompting is now competitive with fine-tuned approaches on specialized tasks, as long as you shape the input carefully. That raises the ceiling for what small teams can ship without ML infrastructure.