Back

Disease Risk Prediction Using Structured EHR Data: Can Generalist Large Language Models Match Specialized Clinical Foundation Models? A Comparative Evaluation with Fine-Tuning

Mao, B.; Prasadha, M. K.; Xie, Z.; He, J.; Ghebranious, M.; Xu, H.; Zhi, D.; Rasmy, L.

2026-05-01 health informatics
10.64898/2026.04.24.26351503 medRxiv
Show abstract

BackgroundElectronic health records (EHRs) with clinical decision support tools are now ubiquitous in healthcare organizations. Clinical foundation models (CFMs) pretrained on large-scale, heterogeneous structured EHR data have emerged as a powerful approach to improve predictive performance and generalizability. Meanwhile, large language models (LLMs) pretrained on broad data sources are being applied to an expanding range of healthcare tasks. However, it remains unclear whether generalist LLMs can match specialized CFMs for disease risk prediction using structured clinical data. MethodsWe compared CFMs (Med-BERT, CLMBR) against fine-tuned generalist LLMs (Mistral, LLaMA-2/3/3.1), a clinical LLM (Me-LLaMA), and LLM-generated embeddings paired with simple classifiers (using DeepSeek, Qwen3, and GPT-OSS) on two disease risk prediction tasks: heart failure risk among diabetic patients (DHF) and pancreatic cancer diagnosis (PaCa). Evaluations spanned multi-site EHR data, claims data, and an open-source single-institution benchmark (EHRSHOT). Performance was assessed using the area under the receiver operating characteristic curve (AUROC) and the area under the precision-recall curve (AUPRC). ResultsOn larger EHR and claims cohorts (>30,000 patients), fine-tuned CFMs outperformed fine-tuned LLMs by a small but statistically significant margin (<1% AUROC). The clinical LLM performed comparably to generalist LLMs despite being smaller. On the open-source PaCa cohort (3,810 patients, 199 cases), LLMs achieved slightly higher AUROCs that were not statistically significant (LLaMA-3.1-70B 86.1% vs. Med-BERT 85.3%, p=0.27), but CFMs achieved significantly higher AUPRC (Med-BERT 55.9% vs. LLaMA-3.1-70B 41.1%, p=0.001). Notably, LLM-generated trajectory embeddings paired with logistic regression or a simple MLP, without any LLM fine-tuning, achieved the best overall performance, with AUROC exceeding 90% (Qwen3) and AUPRC reaching 66% (GPT-OSS 20B). ConclusionLLM-generated embeddings with lightweight classifiers outperformed both fine-tuned CFMs and fine-tuned LLMs on AUROC and AUPRC. While these results demonstrate the potential of generalist models to match or surpass specialized CFMs, their substantially greater computational cost and variable AUPRC performance in the fine-tuning setting warrant caution. We provide a reproducible evaluation framework and codebase to support continued benchmarking.

Matching journals

The top 2 journals account for 50% of the predicted probability mass.

1
npj Digital Medicine
97 papers in training set
Top 0.1%
49.3%
2
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.2%
14.1%
50% of probability mass above
3
Journal of Biomedical Informatics
45 papers in training set
Top 0.3%
6.3%
4
The Lancet Digital Health
25 papers in training set
Top 0.1%
4.3%
5
Scientific Reports
3102 papers in training set
Top 38%
3.5%
6
JMIR Medical Informatics
17 papers in training set
Top 0.6%
2.0%
7
Journal of Medical Internet Research
85 papers in training set
Top 3%
1.7%
8
BMC Medical Informatics and Decision Making
39 papers in training set
Top 2%
1.7%
9
International Journal of Medical Informatics
25 papers in training set
Top 1.0%
1.5%
10
Frontiers in Digital Health
20 papers in training set
Top 1.0%
1.2%
11
JCO Clinical Cancer Informatics
18 papers in training set
Top 0.6%
1.2%
12
JAMIA Open
37 papers in training set
Top 1%
0.9%
13
Artificial Intelligence in Medicine
15 papers in training set
Top 0.6%
0.9%
14
PLOS Digital Health
91 papers in training set
Top 3%
0.7%
15
BMJ Health & Care Informatics
13 papers in training set
Top 1%
0.7%
16
eBioMedicine
130 papers in training set
Top 6%
0.6%
17
Patterns
70 papers in training set
Top 3%
0.6%
18
Bioinformatics
1061 papers in training set
Top 10%
0.6%
19
European Heart Journal - Digital Health
15 papers in training set
Top 0.7%
0.6%
20
JMIR Public Health and Surveillance
45 papers in training set
Top 4%
0.6%