Back

MedResearchBench: A Multi-Domain Benchmark for Evaluating AI Research Agents on Clinical Medical Research

Tan, S.; Tian, Z.

2026-03-31 health informatics
10.64898/2026.03.30.26349749 medRxiv
Show abstract

The rapid advancement of AI research automation systems--including AI Scientist, data-to-paper, and Agent Laboratory--has demonstrated the potential for autonomous scientific discovery. However, existing benchmarks for evaluating these systems focus predominantly on fundamental sciences (machine learning, physics, chemistry), overlooking the unique challenges of medical clinical research: complex survey designs, inferential statistics with confounding control, adherence to reporting standards (STROBE, CONSORT), and the requirement for clinically actionable interpretation. We present MedResearchBench, the first benchmark specifically designed to evaluate AI systems on medical clinical research tasks. MedResearchBench comprises 16 tasks spanning 7 clinical domains (cardiovascular, oncology, mental health, metabolic, respiratory, neurology, infectious disease), built on publicly available datasets (the National Health and Nutrition Examination Survey [NHANES] and the Surveillance, Epidemiology, and End Results [SEER] program) with ground truth from 16 high-quality published papers (IF range: 2.3-51.0). Each task is evaluated along 6 medical-specific dimensions: statistical methodology, results accuracy, visualization quality, clinical interpretation, confounding sensitivity, and reporting compliance. We describe the benchmark design rationale, task construction methodology, paper selection criteria with anti-paper-mill filtering, and a detailed analysis of task characteristics including methodological diversity, evaluation dimension coverage, and difficulty stratification. To demonstrate benchmark executability, we evaluate an agentic data2paper pipeline on 3 pilot tasks spanning all three difficulty tiers, achieving scores of 72/100 (Tier 1, Cardio_000), 69/100 (Tier 2, Mental_000), and 75/100 (Tier 3, Metabolic_002), with a mean score of 72/100 (B-level). Survey-weighted methodology was correctly implemented across all tasks; primary limitations included covariate incompleteness and reference group misspecification. MedResearchBench addresses a critical gap in AI research evaluation and provides a standardized, community-extensible platform for assessing whether AI systems can conduct clinically sound, publication-quality medical research. All task materials are publicly available at https://github.com/TerryFYL/MedResearchBench.

Matching journals

The top 6 journals account for 50% of the predicted probability mass.

1
npj Digital Medicine
97 papers in training set
Top 0.2%
19.0%
2
Patterns
70 papers in training set
Top 0.1%
9.3%
3
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.3%
8.4%
4
Bioinformatics
1061 papers in training set
Top 4%
6.5%
5
Nature Communications
4913 papers in training set
Top 28%
6.5%
6
Scientific Reports
3102 papers in training set
Top 22%
4.9%
50% of probability mass above
7
JCO Clinical Cancer Informatics
18 papers in training set
Top 0.2%
4.0%
8
JAMIA Open
37 papers in training set
Top 0.7%
1.9%
9
Scientific Data
174 papers in training set
Top 1.0%
1.8%
10
PLOS ONE
4510 papers in training set
Top 51%
1.8%
11
Journal of Biomedical Informatics
45 papers in training set
Top 0.8%
1.7%
12
eLife
5422 papers in training set
Top 41%
1.7%
13
Nature Methods
336 papers in training set
Top 4%
1.7%
14
BMC Bioinformatics
383 papers in training set
Top 5%
1.5%
15
The Lancet Digital Health
25 papers in training set
Top 0.5%
1.4%
16
Med
38 papers in training set
Top 0.4%
1.4%
17
Nature Computational Science
50 papers in training set
Top 0.8%
1.4%
18
Nature Medicine
117 papers in training set
Top 4%
0.9%
19
GigaScience
172 papers in training set
Top 2%
0.9%
20
Nature Machine Intelligence
61 papers in training set
Top 3%
0.9%
21
Computers in Biology and Medicine
120 papers in training set
Top 4%
0.8%
22
Annals of Internal Medicine
27 papers in training set
Top 0.8%
0.8%
23
iScience
1063 papers in training set
Top 29%
0.8%
24
Nature Biomedical Engineering
42 papers in training set
Top 2%
0.7%
25
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 47%
0.7%
26
Data in Brief
13 papers in training set
Top 0.6%
0.7%
27
Briefings in Bioinformatics
326 papers in training set
Top 8%
0.5%
28
European Journal of Epidemiology
40 papers in training set
Top 1.0%
0.5%
29
Nature
575 papers in training set
Top 17%
0.5%
30
International Journal of Medical Informatics
25 papers in training set
Top 2%
0.5%