Back

Multi-Model Clinical Validation of an AI-Powered Biomarker Analysis Framework: A Cross-Vendor Benchmark on 4,018 NHANES Patients

Shibakov, D.

2026-02-17 health informatics
10.64898/2026.02.13.26346284 medRxiv
Show abstract

BackgroundLarge language models (LLMs) show promise for clinical decision support, yet most validation studies evaluate single models, leaving questions about generalizability and vendor dependence unanswered. We assessed whether a standardized biomarker analysis framework maintains clinical-grade accuracy across multiple LLMs from independent providers. MethodsWe developed a structured prompt-based framework for detecting eight clinical patterns (insulin resistance, diabetes, cardiovascular disease risk, chronic kidney disease risk, systemic inflammation, nutrient deficiency, liver risk, and anemia) from laboratory biomarkers. We evaluated five LLMs from four providers--Grok-3 (xAI), GPT-4o and GPT-4o-mini (OpenAI), Claude Haiku 4.5 (Anthropic), and Gemini 2.0 Flash (Google)--using identical system prompts and inputs on 4,018 adults from the CDC NHANES 2017-2018. Ground truth was established using published clinical criteria (ADA, AHA, KDIGO, WHO). Performance was measured by F1 score with 95% confidence intervals, sensitivity, specificity, and positive predictive value. ResultsAll five models achieved clinical-grade performance (F1 > 0.86) on eight evaluable patterns. Mean F1 scores ranged from 0.865 (95% CI: 0.799-0.931) for GPT-4o-mini to 0.963 (95% CI: 0.930-0.996) for Grok-3. Flagship models significantly outperformed economy-tier models (mean F1: 0.940 vs 0.881; paired t-test p=0.004). Grok-3 achieved near-perfect scores on liver risk (F1=1.000), anemia (0.999), and nutrient deficiency (0.997). Cardiovascular disease risk was the most challenging pattern (F1 range: 0.853-0.885). JSON parse rates exceeded 99.9% for all models. Total benchmark cost was approximately $59 USD. ConclusionsA standardized prompt-based framework achieves clinical-grade accuracy across five LLMs from four independent providers, demonstrating model-agnostic generalizability. These findings support the feasibility of vendor-independent clinical AI systems that can leverage multiple models without requiring framework revalidation.

Matching journals

The top 2 journals account for 50% of the predicted probability mass.

1
npj Digital Medicine
97 papers in training set
Top 0.1%
41.4%
2
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.1%
19.4%
50% of probability mass above
3
JAMIA Open
37 papers in training set
Top 0.5%
2.7%
4
JAMA Network Open
127 papers in training set
Top 1%
2.2%
5
BMC Medical Informatics and Decision Making
39 papers in training set
Top 1%
2.2%
6
PLOS Digital Health
91 papers in training set
Top 1%
2.0%
7
Scientific Reports
3102 papers in training set
Top 52%
2.0%
8
Frontiers in Digital Health
20 papers in training set
Top 0.5%
2.0%
9
Journal of Biomedical Informatics
45 papers in training set
Top 0.7%
2.0%
10
BMJ Health & Care Informatics
13 papers in training set
Top 0.3%
2.0%
11
International Journal of Medical Informatics
25 papers in training set
Top 0.8%
1.8%
12
Journal of Medical Internet Research
85 papers in training set
Top 2%
1.8%
13
JMIR Medical Informatics
17 papers in training set
Top 0.8%
1.5%
14
JMIR Public Health and Surveillance
45 papers in training set
Top 2%
1.5%
15
JCO Clinical Cancer Informatics
18 papers in training set
Top 0.5%
1.5%
16
PLOS ONE
4510 papers in training set
Top 63%
0.9%
17
Computers in Biology and Medicine
120 papers in training set
Top 4%
0.9%
18
BMC Medical Research Methodology
43 papers in training set
Top 1%
0.8%
19
The Lancet Digital Health
25 papers in training set
Top 0.9%
0.8%
20
Healthcare
16 papers in training set
Top 2%
0.8%
21
Annals of Internal Medicine
27 papers in training set
Top 0.9%
0.8%
22
Nature Medicine
117 papers in training set
Top 4%
0.8%
23
iScience
1063 papers in training set
Top 30%
0.8%
24
Heliyon
146 papers in training set
Top 6%
0.8%
25
Frontiers in Public Health
140 papers in training set
Top 8%
0.8%
26
Cureus
67 papers in training set
Top 5%
0.7%