ReviewBench: An Extensible Framework for Benchmarking Human and AI Manuscript Review
Khalil, N. N.; Reed, T. J.; Ciccozzi, M. R.
Show abstract
AO_SCPLOWBSTRACTC_SCPLOWThe volume of scientific manuscripts is rising faster than the available pool of expert reviewers, and AI tools are emerging as a possible response, ranging from frontier large language models applied directly to peer review to purpose-built multi-agent systems. Scalable, standardized benchmarks are needed to regularly evaluate how these tools compare to one another and to human reviewers. We present ReviewBench, an open-source, venue-agnostic framework that compares human and AI reviews across structure, alignment with a papers major claims, impact, and critique category. We apply ReviewBench to 145,021 review comments from human reviewers, frontier large language models (GPT-5.2, and Gemini 3 Pro), and Reviewer3.com (R3), a multi-agent peer review system. The dataset spans papers in computer science (ICLR 2025, n = 1,000), social science (Nature Human Behaviour, n = 142), and life science (eLife, n = 1,000). Across disciplines, AI reviews are more structured and engage more directly with a papers major claims, with R3 more often surfacing consequential comments, defined as comments capable of undermining those claims. When restricting to critical comments, however, human reviewers rank first on consequential rate on more individual papers than any AI source, despite a lower average. We identify a bimodal reviewer distribution with peaks near 0% and 100%, indicating that many reviewers outperform AI on this metric, but a substantial fraction of reviewers near 0% brings the average down. Critique typing demonstrates systematic differences, where humans emphasize contribution and clarity, while AI emphasizes validity, sufficiency, and transparency. Together, these findings argue against framing AI as a replacement for human review and instead support a complementary model in which AI scales technical verification of major claims while human judgment remains essential for evaluating contribution and shaping editorial decisions.
Matching journals
The top 4 journals account for 50% of the predicted probability mass.