Harvard ER trial in Science: o1 hits 67% diagnostic accuracy, doctors 55%

A new Harvard / Beth Israel study published in Science measured OpenAI's o1 against attending physicians on real emergency room diagnoses, and o1 came out ahead. Seventy-six patients, two attendings as comparison, two more as a blinded panel for ground truth. o1 hit 67% diagnostic accuracy at triage; the comparison physicians scored 55% and 50%. The model received the same EMR data the doctors had at the time of diagnosis — not preprocessed, not curated, not fed expert-summarized vignettes.

The setup is the part that matters. Most "AI beats doctors" headlines run on synthetic case vignettes, board-exam problems, or curated published cases that editors have already cleaned up. This study used real Beth Israel ER intake records with the same information available at the moment a doctor was looking at a patient. Ground truth came from a blinded panel — two evaluating attendings who didn't know which diagnoses came from human physicians and which came from o1. The lead authors are Arjun Manrai (Harvard Medical School) and Adam Rodman (Beth Israel Deaconess). The model tested was o1, not o3 or GPT-5 or Claude Sonnet 4.5; the result is already conservative against current frontier reasoning models. The researchers were explicit on limitations: foundation models tested only on text, "more limited in reasoning over nontext inputs," and not ready for life-or-death decisions without prospective trials.

The ecosystem signal isn't "AI is better than doctors." It's that we now have a credible enough eval methodology that the question stops being "can the model do this on benchmarks" and starts being "what's the deployment path." Rodman flagged the actual gap: there's "no formal framework right now for accountability" when an AI-suggested diagnosis is wrong. That's the load-bearing missing piece. The model is good enough to be useful as a second opinion. The infrastructure for who's responsible when the second opinion is wrong, who audits it, how it's logged, who pays the malpractice premium — none of that exists. Anthropic, OpenAI, and the AWS GovCloud / Vertex Healthcare layers are all selling the model side; the accountability stack remains a regulatory white space.

If you ship medical-AI tooling, this study is the eval bar to clear: real cases, blinded panel, same data the human had. If you're not on that bar, your "outperforms doctors" claim is benchmarketing. If you're a builder watching the ecosystem, the open question to track isn't model accuracy — it's the accountability framework. Whoever ships an auditable diagnostic-AI deployment first (logged reasoning, traceable training data, malpractice-ready insurance product) builds a moat the model labs alone can't. The clinical evidence is now ahead of the regulatory infrastructure. That gap is the next eighteen months of medical AI.

Harvard ER trial in Science: o1 hits 67% diagnostic accuracy, doctors 55%

More News