Back

Boundary-Specific Failure Modes and Safety Trade-offs of Large Language Models in ChronicKidney Disease Renoprotective Therapy Review:A Stratified Synthetic Benchmark

Yeh, S.-E.; Lin, H.-J.; Lai, W.-W.; Lin, H.

2026-05-30 nephrology
10.64898/2026.05.28.26353938 medRxiv
Show abstract

Background.Renoprotective therapies - SGLT2 inhibitors, finerenone, and renin-angiotensin system inhibitors (RASi) - remain underutilisedin chronic kidney disease (CKD). Large language models (LLMs) may detect therapy omissions, but their performance acrossCKD severity strata and at clinical decision boundaries has not been evaluated.Methods.We constructed 100 synthetic CKD vignettes (G3a-G5D; 75 with prespecified omissions, 25 decoys) and queried four LLMsthree times each at temperature 0 (1,200 calls). Omission criteria were adapted from KDIGO 2024, including an investigator-defined gray-zone RASi initiation criterion at eGFR<15. Two nephrologists independently classified a stratified 20-casesubset.Results.For SGLT2 inhibitor and finerenone omissions, all models achieved near-ceiling sensitivity (97-100%). For RASi, performancediverged at the eGFR<15 boundary: Grok 4.1 Fast 85% versus GPT-5.4 55%, Gemini 10%, DeepSeek 10%. Gap-detectioninter-rater agreement was perfect (kappa = 1.000). Clinically incorrect reasoning rates ranged from 0% (GPT-5.4) to 27%(DeepSeek R1); of 52 instances, 31 were factual pharmacology errors and 21 reflected conservative boundary-discordantreasoning. Reproducibility (Jaccard) ranged from 0.74 to 0.93.Conclusions.This boundary-aware synthetic benchmark showed that aggregate sensitivity can conceal clinically important operational-rulediscordance. Rule-based SGLT2 inhibitor and finerenone omissions were detected with near-ceiling sensitivity, whereas aninvestigator-defined gray-zone RASi criterion at eGFR<15 exposed model-specific boundary behaviour. Evaluation of LLM-based CKD decision support should report boundary-specific performance, reproducibility, and clinically incorrect reasoningalongside aggregate metrics.

Matching journals

The top 3 journals account for 50% of the predicted probability mass.