Back

BiomniBench: Process-level Evaluation of LLM Agents for Real-world Biomedical Research

Qu, Y.; Lu, Y.; Tu, X.; Zhang, S.; She, T.; Shaw, A. G.; Shih, J.-H.; Zhao, B.; Shen, M.; Yang, H.; Yan, J.; Zhang, R.; Wu, X.; Li, T.; Cong, L.; Hu, X.; Jiang, Y.; Dong, J.; Peng, T.; Leskovec, J.; Huang, K.

2026-05-14 bioinformatics
10.64898/2026.05.12.724604 bioRxiv
Show abstract

LLM agents now perform real biomedical research, but evaluating them rigorously is hard. Outcome-only benchmarks fail in two ways. First, a correct final answer can come from memorization, reward hacking, or wrong reasoning that produces the right number by chance. Second, valid alternative analyses are marked wrong simply because they differ from the reference. We introduce BiomniBench, a process-level evaluation framework that scores the full agent trajectory against expert-designed, task-specific rubrics. Our first release, BiomniBench-DA, contains 100 data-analysis tasks across 17 task types, 5 disease areas, and a general-biology category, each based on a paper from journals such as Nature, Cell, and Science and co-developed with an original author or a domain expert. Benchmarking frontier and open-weight models across four agent harnesses reveals three findings. Frontier and open-weight bases cluster within a few points of each other, with substantial headroom for all models. The agent harness shifts scores by more than the gap between successive model generations. Agents reliably ground claims in real sources yet consistently fall short on method selection, biological interpretation, and scientific reasoning. BiomniBench is the first process-level benchmark for LLM agents in biomedical research, providing the dimension-level diagnostics that outcome scoring cannot. Datasethuggingface.co/datasets/phylobio/BiomniBench-DA

Matching journals

The top 5 journals account for 50% of the predicted probability mass.

1
Nature Methods
336 papers in training set
Top 0.1%
26.2%
2
Cell Systems
167 papers in training set
Top 2%
6.5%
3
Nature
575 papers in training set
Top 4%
6.5%
4
Nature Biotechnology
147 papers in training set
Top 1%
6.4%
5
Nature Communications
4913 papers in training set
Top 29%
6.4%
50% of probability mass above
6
Bioinformatics
1061 papers in training set
Top 4%
4.9%
7
Genome Biology
555 papers in training set
Top 2%
4.4%
8
Briefings in Bioinformatics
326 papers in training set
Top 3%
2.5%
9
Science
429 papers in training set
Top 11%
2.5%
10
Nature Genetics
240 papers in training set
Top 3%
2.1%
11
Nature Machine Intelligence
61 papers in training set
Top 2%
1.9%
12
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 31%
1.8%
13
Patterns
70 papers in training set
Top 1%
1.5%
14
Scientific Reports
3102 papers in training set
Top 61%
1.5%
15
PLOS Computational Biology
1633 papers in training set
Top 18%
1.3%
16
BMC Bioinformatics
383 papers in training set
Top 5%
1.3%
17
Nucleic Acids Research
1128 papers in training set
Top 13%
1.3%
18
Genome Medicine
154 papers in training set
Top 5%
1.3%
19
GigaScience
172 papers in training set
Top 2%
1.3%
20
Molecular Systems Biology
142 papers in training set
Top 1%
1.0%
21
Cell Genomics
162 papers in training set
Top 6%
0.8%
22
eLife
5422 papers in training set
Top 55%
0.8%
23
Genome Research
409 papers in training set
Top 4%
0.8%
24
Nature Medicine
117 papers in training set
Top 5%
0.8%
25
PLOS ONE
4510 papers in training set
Top 67%
0.8%
26
npj Digital Medicine
97 papers in training set
Top 3%
0.8%
27
Bioinformatics Advances
184 papers in training set
Top 5%
0.7%
28
Scientific Data
174 papers in training set
Top 3%
0.7%
29
Computers in Biology and Medicine
120 papers in training set
Top 5%
0.7%
30
Nature Computational Science
50 papers in training set
Top 2%
0.7%