Ontario auditor general Shelley Spence reported this week that all 20 government-approved AI medical scribe vendors produced inaccuracies during procurement testing — "hallucinations (fabrication), incorrect information, or missing or incomplete information." Approximately 5,000 Ontario physicians are deployed on these same systems. The audit doesn't disclose individual vendor names or concrete error examples. Stephen Crawford, Minister of Public and Business Service Delivery and Procurement, defended the rollout: the hallucinations were observed "during regulatory testing, not actually in operational use with doctors."
The Minister's distinction matters and also doesn't. Procurement test prompts are typically designed to stress-test edge cases — synthetic scenarios that probe failure modes — while operational use is mostly routine encounters. So "fails in procurement, works in practice" is a coherent claim. But the audit's actual finding is the 20-for-20 sweep, not the absolute error rate: every vendor approved for clinical deployment shipped a system that could fabricate medical facts under audit conditions. The Futurism writeup doesn't disclose what those conditions were, what fraction of test cases failed per vendor, or how the procurement gate weighted accuracy against other criteria. Without those numbers, the news is the sweep, not the severity.
AI scribes are one of the fastest-deploying clinical AI categories — Nuance DAX, Abridge, Suki, DeepScribe, and a dozen others occupy this market — because the workflow saving is concrete and the model task (transcribe an encounter, structure it into a known-template SOAP note) maps cleanly to LLM strengths. What this audit changes: procurement-grade evaluation is now a public failure mode. Other Canadian provinces, US hospital systems, and ministries of health will run similar audits and likely produce similar findings. Vendors will respond with stricter eval-harness disclosure and red-team data. The OpenEvidence case Futurism also references — US scrutiny over the system overstating conclusions from small studies — suggests the audit pressure will move past scribes into clinical-research-summarization tools next.
Monday: if you're building or selling AI into clinical workflows, expect a public-evaluation regime over the next 12-24 months — governments will publish procurement test results that name specific failure modes, and "but it works in practice" will not stop the disclosure. Have your harness, your eval set, and your red-team artifacts ready to share before the procurement body asks. If you're a physician using AI scribes today, the audit doesn't tell you which system to drop — but it tells you which assumption to drop: that government approval implies the vendor passes accuracy bars in their actual workflow. Add your own QA on top.
