Back

Diagnostic Accuracy of Large Language Models for Rare Diseases: A Systematic Review and Meta-Analysis

Nguyen, M.-H.; Yang, C.-T.; Cassini, T. A.; Ma, F.; Hamid, R.; Bastarache, L.; Peterson, J. F.; Xu, H.; Li, L.; Ma, S.; Shyr, C.

2026-03-27 genetic and genomic medicine
10.64898/2026.03.26.26349194 medRxiv
Show abstract

Background: Large language models (LLMs) have been evaluated as tools to assist rare disease diagnosis, yet evidence on their accuracy remains fragmented. We conducted a systematic review and meta-analysis to synthesize the available evidence on the diagnostic performance of LLMs, identify sources of heterogeneity, and evaluate the current evidence base for clinical translation. Methods: We searched PubMed, Embase, Web of Science, Cochrane Library, arXiv, and medRxiv (January 2020-February 2026). Full-text articles and preprints were considered for inclusion. Eligible studies applied LLM-based systems to generate differential diagnoses for rare diseases and provided Recall@1 (R@1; proportion with the correct diagnosis ranked first). We pooled R@1 using Freeman-Tukey double arcsine transformation with DerSimonian-Laird random-effects models. Pre-specified subgroup analyses examined LLM knowledge augmentation strategy and input modality. Because both retained high residual heterogeneity, we conducted a post-hoc exploratory analysis of evaluation benchmark disease composition, mapping diseases from major benchmarks to Orphanet prevalence classifications. Risk of bias was assessed using a modified QUADAS-3 instrument. Findings: We identified 902 records, of which 564 were screened and 15 studies were eligible. These 15 studies contributed 19 system-dataset entries to the meta-analysis (total N=39,529 cases). The pooled R@1 was 43.3% (95% CI 35.1-51.6; I2=99.6%). Augmented LLM systems (agent-based reasoning, retrieval, or fine-tuning; k=8) achieved R@1 of 52.5% (42.0-62.9) versus 35.4% (30.6-40.4) for standalone LLMs (k=11; p=0.004). Post-hoc exploratory analysis indicated that evaluation benchmark disease composition was associated with differences in diagnostic performance: R@1 was lower on the Phenopacket Store dataset, which contained a higher proportion of ultra-rare diseases (52.8%; k=2), than on RareBench (29.3%; k=6) at 21.7% (18.2-25.5) versus 52.0% (40.7-63.2; p<0.001). All 19 system-dataset entries were assessed to be at high risk of bias, most commonly due to potential data leakage and limited reproducibility. No study provided prospective clinical validation. Interpretation: Diagnostic performance of LLM-based systems for rare diseases varied substantially across evaluation benchmarks. Post-hoc exploratory analysis indicated that performance was associated with benchmark disease composition. Performance was higher in benchmarks containing fewer ultra-rare diseases and in systems incorporating external knowledge at inference time. However, all included studies were at high risk of bias, and none reported prospective clinical validation. These findings highlight the need for prevalence-stratified evaluation benchmarks and independent prospective studies before clinical deployment. Funding: This work was supported in part by the National Institutes of Health Common Fund, grant 15-HG-0130 from the National Human Genome Research Institute, U01NS134349 from the National Institute of Neurological Disorders and Stroke, R00LM014429 from the National Library of Medicine, and the Potocsnak Center for Undiagnosed and Rare Disorders.

Matching journals

The top 2 journals account for 50% of the predicted probability mass.

1
Genetics in Medicine
69 papers in training set
Top 0.1%
35.2%
2
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.1%
27.6%
50% of probability mass above
3
npj Digital Medicine
97 papers in training set
Top 2%
2.2%
4
Scientific Reports
3102 papers in training set
Top 52%
2.0%
5
Genome Medicine
154 papers in training set
Top 4%
1.8%
6
JAMA Network Open
127 papers in training set
Top 2%
1.8%
7
Journal of Clinical Epidemiology
28 papers in training set
Top 0.2%
1.8%
8
Journal of Biomedical Informatics
45 papers in training set
Top 1.0%
1.3%
9
PLOS ONE
4510 papers in training set
Top 61%
1.2%
10
The American Journal of Human Genetics
206 papers in training set
Top 3%
1.2%
11
Med
38 papers in training set
Top 0.5%
1.0%
12
The Lancet Digital Health
25 papers in training set
Top 0.7%
1.0%
13
Nature Medicine
117 papers in training set
Top 4%
1.0%
14
eBioMedicine
130 papers in training set
Top 3%
0.8%
15
iScience
1063 papers in training set
Top 29%
0.8%
16
PLOS Digital Health
91 papers in training set
Top 2%
0.8%
17
GENETICS
189 papers in training set
Top 1%
0.8%
18
European Journal of Human Genetics
49 papers in training set
Top 1%
0.8%
19
Clinical Pharmacology & Therapeutics
25 papers in training set
Top 0.7%
0.8%
20
International Journal of Medical Informatics
25 papers in training set
Top 2%
0.8%
21
Healthcare
16 papers in training set
Top 2%
0.8%
22
PLOS Biology
408 papers in training set
Top 22%
0.7%
23
Nature Human Behaviour
85 papers in training set
Top 5%
0.7%
24
Archives of Disease in Childhood
15 papers in training set
Top 0.5%
0.5%
25
BMC Medicine
163 papers in training set
Top 8%
0.5%
26
BMC Medical Research Methodology
43 papers in training set
Top 2%
0.5%
27
Human Mutation
29 papers in training set
Top 0.9%
0.5%