Back

Frontier Large Language Models for Comprehensive Medication Review in CKD Patients with Polypharmacy: A Trap-Embedded Synthetic Benchmark

Chuang, K.-C.; Lin, H.-J.; Lin, H.-M.

2026-05-26 health informatics
10.64898/2026.05.23.26353939 medRxiv
Show abstract

Background: Patients with CKD and polypharmacy face high rates of drug-related problems, yet comprehensive medication review remains time-intensive and inconsistently performed. Large language models (LLMs) may augment this process, but existing benchmarks use multiple-choice formats that do not reflect open-ended, nephrology-specific review. We developed a trap-embedded synthetic CKD benchmark and evaluated five current-generation LLMs (GPT-5.4, Claude Sonnet 4.6, Gemini 3.1 Pro, Grok 4.1 Fast, DeepSeek R1; tested April-May 2026) for open-ended medication review. Methods: Fifty synthetic CKD cases across three complexity groups (G3a-G3b [n=20], G4 [n=15], G5/G5D/transplant [n=15]) with 8-12 medications and [&ge;]2 embedded clinical traps each were scored against nephrologist-adjudicated gold standards. Each model produced three independent responses per case (temperature 0; 750 total outputs). Primary endpoint was per-case macro F1; secondary endpoints were safety-critical omission rate, PI-adjudicated hallucination rate, and intra-model consistency. Blinded inter-rater reliability for gold-standard item detection was assessed on a 30% sample. Results: Consensus-level macro F1 ranged from 0.41 (Claude Sonnet 4.6) to 0.49 (Grok 4.1 Fast) (Friedman P < 0.001). Phosphate binder timing (11%) and hyperkalemia combinations (33%) were poorly detected across all models. Safety-critical omission rate ranged from 22% to 48% (P < 0.001); PI-adjudicated hallucination ranged from 0% (GPT-5.4) to 54% (DeepSeek R1), including fabricated dose caps and non-existent guideline citations. Blinded reliability for gold-standard item detection was high (kappa = 0.934, n = 92). Conclusions: This nephrology-specific benchmark exposes clinically important LLM blind spots that generic multiple-choice evaluations would not detect. Heterogeneous hallucination and omission rates indicate that model selection and domain-specific guardrails should precede any clinical deployment of LLM-assisted CKD medication review. Prospective validation with real patient data and human comparators is required before deployment recommendations can be made.

Matching journals

The top 7 journals account for 50% of the predicted probability mass.

1
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.1%
17.6%
2
JAMIA Open
37 papers in training set
Top 0.1%
8.4%
3
PLOS ONE
4510 papers in training set
Top 22%
8.2%
4
Journal of the American Society of Nephrology
52 papers in training set
Top 0.2%
6.4%
5
npj Digital Medicine
97 papers in training set
Top 1.0%
4.9%
6
Kidney360
22 papers in training set
Top 0.2%
4.3%
7
BMJ Open
554 papers in training set
Top 6%
3.6%
50% of probability mass above
8
BMJ Health & Care Informatics
13 papers in training set
Top 0.3%
2.1%
9
Clinical and Translational Science
21 papers in training set
Top 0.4%
1.9%
10
BMC Nephrology
13 papers in training set
Top 0.2%
1.9%
11
Kidney International Reports
14 papers in training set
Top 0.2%
1.9%
12
JMIR Public Health and Surveillance
45 papers in training set
Top 1%
1.8%
13
Annals of Internal Medicine
27 papers in training set
Top 0.4%
1.8%
14
BMC Medical Informatics and Decision Making
39 papers in training set
Top 1%
1.7%
15
Scientific Reports
3102 papers in training set
Top 61%
1.5%
16
PLOS Digital Health
91 papers in training set
Top 2%
1.3%
17
BMC Medicine
163 papers in training set
Top 5%
1.2%
18
JMIR Medical Informatics
17 papers in training set
Top 1%
1.0%
19
The Lancet Digital Health
25 papers in training set
Top 0.8%
1.0%
20
BMC Medical Research Methodology
43 papers in training set
Top 1%
0.9%
21
Journal of Personalized Medicine
28 papers in training set
Top 0.9%
0.9%
22
British Journal of General Practice
22 papers in training set
Top 0.5%
0.8%
23
BMC Infectious Diseases
118 papers in training set
Top 5%
0.8%
24
European Respiratory Journal
54 papers in training set
Top 2%
0.7%
25
Wellcome Open Research
57 papers in training set
Top 2%
0.7%
26
Diabetologia
36 papers in training set
Top 0.9%
0.7%
27
British Journal of Clinical Pharmacology
21 papers in training set
Top 0.7%
0.7%
28
Frontiers in Digital Health
20 papers in training set
Top 2%
0.6%
29
Pilot and Feasibility Studies
12 papers in training set
Top 0.7%
0.6%
30
Pharmacoepidemiology and Drug Safety
13 papers in training set
Top 0.5%
0.6%