OpenAI Built a Biology Benchmark Where Winning Means Failing 70% of the Time

The most interesting number in OpenAI's new GeneBench-Pro benchmark is not 31.5%. It's the 70% that remains below it.

OpenAI released GeneBench-Pro on June 30, a successor to its original GeneBench, built to test whether AI agents can do the kind of messy, judgment-heavy work that makes computational biology hard. Not "what is a p-value" hard. The benchmark presents an agent with a noisy dataset, a brief experimental context, and a question, then asks it to figure out which analysis the data can actually support, revise assumptions when early diagnostics go sideways, and know when its original plan needs to be scrapped. OpenAI calls this skill "research taste." That phrasing is doing a lot of work, and I think they mean it seriously.

The benchmark has 129 problems spanning genomics, quantitative biology, and translational medicine. Every problem is synthetic, generated from a known causal structure, so answers can be graded against ground truth without the rubric variability that plagues most long-horizon science evaluations. OpenAI sent 82 of the 129 problems to external domain experts, including postdocs and professors, to verify they reflected realistic research and had identifiable correct answers. Ten representative questions and a 50-question subset are open for third-party use.

GPT-5.6 Sol Pro hit a 31.5% pass rate at maximum reasoning. GPT-5.6 Sol without Pro mode: 28.7%. The best non-OpenAI result, from Anthropic's Claude Opus 4.8, was 16.0%. Google's Gemini 3.5 Flash came in at 8.1%. For context, on the original GeneBench, GPT-5 scored below 5%.

There's real progress there. But the benchmark's designers clearly don't think the story is the scores. They built a test specifically around the class of problems where current AI fails not because it lacks knowledge, but because it lacks judgment about how to apply that knowledge. The comparison they keep reaching for is a scientist who has the expertise but still has to decide whether a pattern is signal or noise, whether the chosen estimand matches what the data can actually estimate, and when to abandon an analysis path that looked fine at the start.

Here's my read: this is the most honest framing I've seen from a major lab about where their models actually sit in scientific research. A 31.5% pass rate on a benchmark designed by the company running the best model is a strange thing to ship, unless the company is trying to say something. I think they are. The AI-will-accelerate-drug-discovery pitch has been running for years. GeneBench-Pro is a quiet admission that the piece currently missing isn't compute or context window. It's the iterative judgment that sits between running an analysis and trusting a result.

The choice to make the benchmark synthetic rather than pulling from published literature is worth noting. It eliminates data contamination concerns, which are brutal on biology benchmarks because so much genomic methodology is thoroughly documented online. It also means the difficulty can be tuned deliberately. The fact that 60% of problems sit below a 20% pass rate even for the strongest models on the original GeneBench isn't an accident of selection. It's a design choice that says: here is where the ceiling is.

What I keep coming back to is that phrase, "research taste." It names something real. The ability to notice that your data can't support the question you came in asking, and redirect before you produce a confident wrong answer, is genuinely hard to evaluate and genuinely important. The fact that OpenAI tried to build a formal test for it, and then scored below a third on their own test, is either a strange kind of marketing or a useful act of honesty about the gap between what current models can do and what scientific practice actually requires. I'm inclined toward the latter.

Related dispatches