Back

ReviewBench: An Extensible Framework for Benchmarking Human and AI Manuscript Review

Khalil, N. N.; Reed, T. J.; Ciccozzi, M. R.

2026-04-20 scientific communication and education
10.64898/2026.04.17.719279 bioRxiv
Show abstract

AO_SCPLOWBSTRACTC_SCPLOWThe volume of scientific manuscripts is rising faster than the available pool of expert reviewers, and AI tools are emerging as a possible response, ranging from frontier large language models applied directly to peer review to purpose-built multi-agent systems. Scalable, standardized benchmarks are needed to regularly evaluate how these tools compare to one another and to human reviewers. We present ReviewBench, an open-source, venue-agnostic framework that compares human and AI reviews across structure, alignment with a papers major claims, impact, and critique category. We apply ReviewBench to 145,021 review comments from human reviewers, frontier large language models (GPT-5.2, and Gemini 3 Pro), and Reviewer3.com (R3), a multi-agent peer review system. The dataset spans papers in computer science (ICLR 2025, n = 1,000), social science (Nature Human Behaviour, n = 142), and life science (eLife, n = 1,000). Across disciplines, AI reviews are more structured and engage more directly with a papers major claims, with R3 more often surfacing consequential comments, defined as comments capable of undermining those claims. When restricting to critical comments, however, human reviewers rank first on consequential rate on more individual papers than any AI source, despite a lower average. We identify a bimodal reviewer distribution with peaks near 0% and 100%, indicating that many reviewers outperform AI on this metric, but a substantial fraction of reviewers near 0% brings the average down. Critique typing demonstrates systematic differences, where humans emphasize contribution and clarity, while AI emphasizes validity, sufficiency, and transparency. Together, these findings argue against framing AI as a replacement for human review and instead support a complementary model in which AI scales technical verification of major claims while human judgment remains essential for evaluating contribution and shaping editorial decisions.

Matching journals

The top 4 journals account for 50% of the predicted probability mass.

1
Nature Neuroscience
216 papers in training set
Top 0.1%
22.8%
2
Nature Biotechnology
147 papers in training set
Top 0.2%
18.9%
3
Nature
575 papers in training set
Top 4%
7.3%
4
Nature Human Behaviour
85 papers in training set
Top 0.3%
6.9%
50% of probability mass above
5
PLOS Biology
408 papers in training set
Top 3%
3.6%
6
eLife
5422 papers in training set
Top 24%
3.6%
7
eneuro
389 papers in training set
Top 3%
3.1%
8
PLOS ONE
4510 papers in training set
Top 44%
2.8%
9
Nature Methods
336 papers in training set
Top 3%
2.8%
10
PLOS Computational Biology
1633 papers in training set
Top 12%
2.6%
11
Nature Genetics
240 papers in training set
Top 4%
1.9%
12
Philosophical Transactions of the Royal Society B
51 papers in training set
Top 3%
1.8%
13
Bioinformatics
1061 papers in training set
Top 7%
1.7%
14
Science
429 papers in training set
Top 14%
1.7%
15
Journal of Cell Biology
333 papers in training set
Top 2%
1.7%
16
Genome Biology
555 papers in training set
Top 6%
1.0%
17
Scientific Reports
3102 papers in training set
Top 69%
1.0%
18
Patterns
70 papers in training set
Top 2%
1.0%
19
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 41%
0.9%
20
Nature Machine Intelligence
61 papers in training set
Top 3%
0.8%
21
GigaScience
172 papers in training set
Top 3%
0.8%
22
Journal of the American Medical Informatics Association
61 papers in training set
Top 2%
0.8%
23
Nature Communications
4913 papers in training set
Top 61%
0.8%
24
Molecular Systems Biology
142 papers in training set
Top 2%
0.8%
25
Cell Systems
167 papers in training set
Top 12%
0.8%
26
Nature Medicine
117 papers in training set
Top 5%
0.8%
27
Computers in Biology and Medicine
120 papers in training set
Top 5%
0.7%
28
BMC Bioinformatics
383 papers in training set
Top 7%
0.7%
29
BMC Medicine
163 papers in training set
Top 7%
0.7%