Multi-Model Clinical Validation of an AI-Powered Biomarker Analysis Framework: A Cross-Vendor Benchmark on 4,018 NHANES Patients
Shibakov, D.
Show abstract
BackgroundLarge language models (LLMs) show promise for clinical decision support, yet most validation studies evaluate single models, leaving questions about generalizability and vendor dependence unanswered. We assessed whether a standardized biomarker analysis framework maintains clinical-grade accuracy across multiple LLMs from independent providers. MethodsWe developed a structured prompt-based framework for detecting eight clinical patterns (insulin resistance, diabetes, cardiovascular disease risk, chronic kidney disease risk, systemic inflammation, nutrient deficiency, liver risk, and anemia) from laboratory biomarkers. We evaluated five LLMs from four providers--Grok-3 (xAI), GPT-4o and GPT-4o-mini (OpenAI), Claude Haiku 4.5 (Anthropic), and Gemini 2.0 Flash (Google)--using identical system prompts and inputs on 4,018 adults from the CDC NHANES 2017-2018. Ground truth was established using published clinical criteria (ADA, AHA, KDIGO, WHO). Performance was measured by F1 score with 95% confidence intervals, sensitivity, specificity, and positive predictive value. ResultsAll five models achieved clinical-grade performance (F1 > 0.86) on eight evaluable patterns. Mean F1 scores ranged from 0.865 (95% CI: 0.799-0.931) for GPT-4o-mini to 0.963 (95% CI: 0.930-0.996) for Grok-3. Flagship models significantly outperformed economy-tier models (mean F1: 0.940 vs 0.881; paired t-test p=0.004). Grok-3 achieved near-perfect scores on liver risk (F1=1.000), anemia (0.999), and nutrient deficiency (0.997). Cardiovascular disease risk was the most challenging pattern (F1 range: 0.853-0.885). JSON parse rates exceeded 99.9% for all models. Total benchmark cost was approximately $59 USD. ConclusionsA standardized prompt-based framework achieves clinical-grade accuracy across five LLMs from four independent providers, demonstrating model-agnostic generalizability. These findings support the feasibility of vendor-independent clinical AI systems that can leverage multiple models without requiring framework revalidation.
Matching journals
The top 2 journals account for 50% of the predicted probability mass.