OpenAI's new LifeSciBench grades AI on real biology research, and the best model passes only 36%, Zubnet AI News

OpenAI has released LifeSciBench, a benchmark that tries to measure how well AI actually helps with real-world life-science research, and the headline result is humbling: the strongest model tested passes only 36.1% of the tasks. It reads as a deliberate reality check in a week thick with claims about AI matching doctors and helping discover chemistry.

Rather than a quiz of facts, LifeSciBench was built by 173 PhD scientists from biotech and pharmaceutical research, who wrote 750 tasks spanning seven research workflows, from handling evidence to running analysis to communicating results. Each task is graded against a detailed rubric, 19,020 criteria in all and about 25 per task, that score the specific claims, calculations, decisions, and justifications a good answer needs to contain. Nearly four in five of the tasks require several reasoning or decision steps, so the test grades judgment rather than recall.

On that bar, the models struggle. OpenAI's own domain-specialized model, GPT-Rosalind, led the field, posting the best per-task score on 386 of the 750 tasks and lifting the overall pass rate from GPT-5.5's 25.7% to 36.1%. Even so, that top score means the best system still fails close to two thirds of what expert scientists would consider solid research work. A benchmark whose own maker tops out near a third is, in its way, a useful admission about where the technology actually stands.

The timing is pointed. This same week brought a model that matched primary-care doctors on managing disease, another that helped improve a chemistry reaction, and an image company announcing a medical scanner, all of which invite the read that AI has arrived in the lab and the clinic. LifeSciBench is the counterweight from inside the same industry: when you grade the work the way working scientists do, against what a careful answer must actually contain, today's best models clear about a third of it. The capability is real and climbing, but the distance left to expert level is exactly the part the demonstrations tend to leave out.

OpenAI's new LifeSciBench grades AI on real biology research, and the best model passes only 36%

More News