Back

Diagnostic Accuracy of Large Language Models for Rare Diseases: A Systematic Review and Meta-Analysis

Nguyen, M.-H.; Yang, C.-T.; Cassini, T. A.; Ma, F.; Hamid, R.; Bastarache, L.; Peterson, J. F.; Xu, H.; Li, L.; Ma, S.; Shyr, C.

2026-03-27 genetic and genomic medicine

10.64898/2026.03.26.26349194 medRxiv

Show abstract

Background: Large language models (LLMs) have been evaluated as tools to assist rare disease diagnosis, yet evidence on their accuracy remains fragmented. We conducted a systematic review and meta-analysis to synthesize the available evidence on the diagnostic performance of LLMs, identify sources of heterogeneity, and evaluate the current evidence base for clinical translation. Methods: We searched PubMed, Embase, Web of Science, Cochrane Library, arXiv, and medRxiv (January 2020-February 2026). Full-text articles and preprints were considered for inclusion. Eligible studies applied LLM-based systems to generate differential diagnoses for rare diseases and provided Recall@1 (R@1; proportion with the correct diagnosis ranked first). We pooled R@1 using Freeman-Tukey double arcsine transformation with DerSimonian-Laird random-effects models. Pre-specified subgroup analyses examined LLM knowledge augmentation strategy and input modality. Because both retained high residual heterogeneity, we conducted a post-hoc exploratory analysis of evaluation benchmark disease composition, mapping diseases from major benchmarks to Orphanet prevalence classifications. Risk of bias was assessed using a modified QUADAS-3 instrument. Findings: We identified 902 records, of which 564 were screened and 15 studies were eligible. These 15 studies contributed 19 system-dataset entries to the meta-analysis (total N=39,529 cases). The pooled R@1 was 43.3% (95% CI 35.1-51.6; I2=99.6%). Augmented LLM systems (agent-based reasoning, retrieval, or fine-tuning; k=8) achieved R@1 of 52.5% (42.0-62.9) versus 35.4% (30.6-40.4) for standalone LLMs (k=11; p=0.004). Post-hoc exploratory analysis indicated that evaluation benchmark disease composition was associated with differences in diagnostic performance: R@1 was lower on the Phenopacket Store dataset, which contained a higher proportion of ultra-rare diseases (52.8%; k=2), than on RareBench (29.3%; k=6) at 21.7% (18.2-25.5) versus 52.0% (40.7-63.2; p<0.001). All 19 system-dataset entries were assessed to be at high risk of bias, most commonly due to potential data leakage and limited reproducibility. No study provided prospective clinical validation. Interpretation: Diagnostic performance of LLM-based systems for rare diseases varied substantially across evaluation benchmarks. Post-hoc exploratory analysis indicated that performance was associated with benchmark disease composition. Performance was higher in benchmarks containing fewer ultra-rare diseases and in systems incorporating external knowledge at inference time. However, all included studies were at high risk of bias, and none reported prospective clinical validation. These findings highlight the need for prevalence-stratified evaluation benchmarks and independent prospective studies before clinical deployment. Funding: This work was supported in part by the National Institutes of Health Common Fund, grant 15-HG-0130 from the National Human Genome Research Institute, U01NS134349 from the National Institute of Neurological Disorders and Stroke, R00LM014429 from the National Library of Medicine, and the Potocsnak Center for Undiagnosed and Rare Disorders.

Diagnostic Accuracy of Large Language Models for Rare Diseases: A Systematic Review and Meta-Analysis

Matching journals