Algorithmic Versus Expert Rankings of Large Language Models in Peritoneal Dialysis Prescription Review: A Trap-Embedded Synthetic Benchmark
Wei, C.-H.; Lin, H.-J.; Lai, W.-W.; Lin, H. M.
Show abstract
Background: Clinical LLM benchmarks rarely test whether algorithmic rankings agree with expert clinical judgment. We developed a trap-embedded peritoneal dialysis (PD) benchmark comparing multiple scoring constructs with blinded nephrologist ratings. Methods: We generated 125 synthetic PD cases containing 13 ISPD-aligned trap types. Five LLMs (Claude Sonnet 4.5, GPT-5.4, Gemini 3.1 Pro, DeepSeek-R1, Grok 4.1 Fast) evaluated each case three times at temperature 0 (1,875 calls). Primary outcome was must-identify TDR_must, analyzed with GEE and case-clustered bootstrap. Secondary analyses included a verbosity-sensitive alarm-burden proxy, WCS, relaxed-match scoring, WCS sensitivity analyses, and a 25-output blinded expert adequacy substudy. Must-identify kappa was 0.89 in Stage 1 and 0.92 in Stage 2. Results: Rankings were discordant. Recall ranked Claude (0.977) and GPT-5.4 (0.955) above the other models (0.86-0.90, p<0.0001). The alarm-burden proxy favored concise models (Grok 0.689; 21.6 vs 2.4 issues/case), while WCS produced a third ordering. In the expert substudy, inter-rater concordance was strong (rho 0.977), but WCS did not show a positive association with expert adequacy (rho -0.17, p=0.41). Conclusion: Clinical LLM rankings in PD prescription review depend strongly on scoring construct. Algorithmic metrics should be reported alongside blinded expert adequacy ratings and should not alone determine deployment.
Matching journals
The top 2 journals account for 50% of the predicted probability mass.