OpenAI Introduces GeneBench-Pro, a Computational Biology Benchmark for AI Agents
What happened
OpenAI released GeneBench-Pro, a new benchmark evaluating AI agents on real computational biology research tasks. It comprises 129 problems spanning genomics, quantitative biology, and translational medicine, each involving noisy real-world datasets and high-stakes judgment calls.
Context and impact
Human experts estimate each task takes 20–40 hours to complete. The benchmark reveals that even the best model falls far short of expert-level performance in this domain. OpenAI projects the benchmark to become saturated by end of 2026 at current improvement rates.
Details
- 129 tasks across genomics, quantitative biology, and translational medicine
- GPT-5.6 Sol: 31.5% (top performer)
- Claude Opus 4.8: 16%
- Gemini 3.5 Flash: 8.1%
- Each task estimated at 20–40 hours of human expert work
- OpenAI projects benchmark saturation by end of 2026
Open original source
OpenAI