A new Stanford-led study published in Nature Medicine, with lead authors Ethan Goh and Robert Gallo and senior authors Jonathan Chen at Stanford and Adam Rodman at Harvard, ran 70 U.S.-licensed physicians against a custom GPT-4 system across 254 simulated clinical case vignettes. The headline numbers are striking: physicians with conventional reference tools scored 75% on diagnosis and management, AI used as a first-opinion lifted them to 85%, AI as second opinion got them to 82%, and the AI working alone scored 87%. Clinician openness to using the tool jumped from 91% before the trial to 99% after. Coverage is treating this as "chatbots outperform doctors." Read the methodology section instead.

The study used vignettes — structured case descriptions written for evaluation purposes — not actual patient encounters. The authors are explicit about why: vignettes are controllable, scorable, and reproducible. They are also, in their own words, "less representative of real practice." A vignette gives the model and the physician the same clean text input, no missing data, no ambiguous patient affect, no time pressure, no chart noise, no follow-up questions that have to be asked at the right moment. The doctor in the trial was working with internet searches and medical references but not the doctor's actual toolkit, which includes the physical exam, the longitudinal relationship with the patient, and the workflow that lets the doctor recognize when something is off in a way text cannot capture. AI getting 87% on a vignette is not the same thing as AI getting 87% on a real clinic visit, and the authors know this.

Read alongside the Nature Medicine editorial published the same week, which argued that "evidence that AI tools create value for patients, providers, or health systems remains scarce" and called for prospective evaluation against agreed-upon benchmarks, the Goh-Rodman paper is exactly the type of work the editorial was talking about. It is rigorous, it produces a useful directional signal, and it does not establish the kind of evidence that justifies broad clinical deployment. The +9.9% accuracy lift from AI-as-first-opinion is meaningful as a hypothesis to test in a real-world prospective trial. It is not yet meaningful as a basis for telling hospital systems to integrate the tool. The 10% system failure rate the authors note, the non-determinism they note, and the gap between vignette difficulty and live encounter difficulty are all reasons the next study has to look different than this one.

For builders working on clinical AI, the practical reading is that this is the level of evidence the field is currently producing — vignette studies, retrospective benchmark wins, and openness surveys — and it is not enough. The Stanford team is doing high-quality work, and their results are a credible argument that LLM second opinions could improve diagnostic accuracy in some workflows. What is still missing is the prospective trial that puts the same system into a real clinic, with real patients, real time pressure, real workflow integration, and a real outcome metric tied to patient outcomes rather than vignette scoring. Nature Medicine's editors are right that the field has been declaring victory before generating that evidence, and this study, despite its strong design, is part of the evidence base that is still pre-deployment-grade. The next round of studies that matter are the ones running in actual hospitals, measuring actual changes in diagnostic accuracy and time-to-correct-diagnosis at the point of care.