MedResearchBench: A Multi-Domain Benchmark for Evaluating AI Research Agents on Clinical Medical Research

Tan, S.; Tian, Z.

2026-03-31 health informatics

10.64898/2026.03.30.26349749 medRxiv

Show abstract

The rapid advancement of AI research automation systems--including AI Scientist, data-to-paper, and Agent Laboratory--has demonstrated the potential for autonomous scientific discovery. However, existing benchmarks for evaluating these systems focus predominantly on fundamental sciences (machine learning, physics, chemistry), overlooking the unique challenges of medical clinical research: complex survey designs, inferential statistics with confounding control, adherence to reporting standards (STROBE, CONSORT), and the requirement for clinically actionable interpretation. We present MedResearchBench, the first benchmark specifically designed to evaluate AI systems on medical clinical research tasks. MedResearchBench comprises 16 tasks spanning 7 clinical domains (cardiovascular, oncology, mental health, metabolic, respiratory, neurology, infectious disease), built on publicly available datasets (the National Health and Nutrition Examination Survey [NHANES] and the Surveillance, Epidemiology, and End Results [SEER] program) with ground truth from 16 high-quality published papers (IF range: 2.3-51.0). Each task is evaluated along 6 medical-specific dimensions: statistical methodology, results accuracy, visualization quality, clinical interpretation, confounding sensitivity, and reporting compliance. We describe the benchmark design rationale, task construction methodology, paper selection criteria with anti-paper-mill filtering, and a detailed analysis of task characteristics including methodological diversity, evaluation dimension coverage, and difficulty stratification. To demonstrate benchmark executability, we evaluate an agentic data2paper pipeline on 3 pilot tasks spanning all three difficulty tiers, achieving scores of 72/100 (Tier 1, Cardio_000), 69/100 (Tier 2, Mental_000), and 75/100 (Tier 3, Metabolic_002), with a mean score of 72/100 (B-level). Survey-weighted methodology was correctly implemented across all tasks; primary limitations included covariate incompleteness and reference group misspecification. MedResearchBench addresses a critical gap in AI research evaluation and provides a standardized, community-extensible platform for assessing whether AI systems can conduct clinically sound, publication-quality medical research. All task materials are publicly available at https://github.com/TerryFYL/MedResearchBench.

MedResearchBench: A Multi-Domain Benchmark for Evaluating AI Research Agents on Clinical Medical Research

Matching journals