Back

Citation Hallucination Determines Success: An Empirical Comparison of Six Medical AI Research Systems

Shi, X.; Tian, Z.; Tan, S.; Wang, X.

2026-04-04 health informatics
10.64898/2026.04.02.26350091 medRxiv
Show abstract

Large language model (LLM) systems can now generate complete research manuscripts, yet their reliability in clinical medicine - where citation accuracy and reporting standards carry direct consequences - has not been systematically assessed. We introduce MedResearchBench, a benchmark of three clinical epidemiology tasks built on NHANES data, and use it to evaluate six AI research systems across six quality dimensions. Evaluation combines programmatic citation verification, rule-based reporting compliance checks, and multi-model LLM judging, providing a more discriminative assessment than conventional single-judge approaches. Citation integrity emerged as the decisive quality dimension. Hallucination rates ranged from 2.9% to 36.8% across systems, and a hard-rule threshold on per-task citation scores capped four of six systems' total scores at the penalty ceiling. Adding a multi-agent citation verification and repair pipeline to the best-performing system improved its citation integrity score from 40.0 to 90.9 and raised the weighted total from 68.9 to 81.8. Strikingly, a single-model evaluation ranked this system last (55.5), while our three-tier framework ranked it first (81.8) - a complete reversal that exposes the limitations of subjective LLM-only evaluation. These results suggest that programmatic citation verification should be a core metric in future evaluations of AI scientific writing systems, and that multi-agent quality assurance can bridge the gap between fluent text generation and trustworthy scholarship.

Matching journals

The top 7 journals account for 50% of the predicted probability mass.

1
npj Digital Medicine
97 papers in training set
Top 0.2%
19.0%
2
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.2%
12.9%
3
Scientific Reports
3102 papers in training set
Top 22%
4.9%
4
Philosophical Transactions of the Royal Society B
51 papers in training set
Top 0.7%
4.9%
5
PLOS Digital Health
91 papers in training set
Top 0.9%
2.8%
6
Bioinformatics
1061 papers in training set
Top 6%
2.8%
7
Patterns
70 papers in training set
Top 0.4%
2.7%
50% of probability mass above
8
Nature Communications
4913 papers in training set
Top 45%
2.5%
9
PLOS ONE
4510 papers in training set
Top 46%
2.4%
10
Journal of Biomedical Informatics
45 papers in training set
Top 0.6%
2.1%
11
Annals of Internal Medicine
27 papers in training set
Top 0.3%
2.1%
12
JCO Clinical Cancer Informatics
18 papers in training set
Top 0.4%
1.9%
13
The Lancet Digital Health
25 papers in training set
Top 0.3%
1.8%
14
Nature Medicine
117 papers in training set
Top 2%
1.7%
15
Nature Human Behaviour
85 papers in training set
Top 2%
1.7%
16
Computers in Biology and Medicine
120 papers in training set
Top 2%
1.5%
17
International Journal of Medical Informatics
25 papers in training set
Top 1%
1.4%
18
BMC Bioinformatics
383 papers in training set
Top 5%
1.2%
19
JAMIA Open
37 papers in training set
Top 1%
1.2%
20
Med
38 papers in training set
Top 0.5%
1.0%
21
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 39%
1.0%
22
European Journal of Epidemiology
40 papers in training set
Top 0.6%
0.9%
23
BMC Medicine
163 papers in training set
Top 6%
0.9%
24
Nature Machine Intelligence
61 papers in training set
Top 3%
0.9%
25
Journal of Medical Internet Research
85 papers in training set
Top 4%
0.8%
26
Artificial Intelligence in Medicine
15 papers in training set
Top 0.6%
0.8%
27
eLife
5422 papers in training set
Top 57%
0.8%
28
iScience
1063 papers in training set
Top 31%
0.8%
29
BMJ Health & Care Informatics
13 papers in training set
Top 0.9%
0.8%
30
GENETICS
189 papers in training set
Top 2%
0.7%