Back

A Three-Tier Operational Benchmark for Evaluating Large Language Models on Hospital Medication Safety

Proulx, J.; Daines, B.; Barton, M.; Leonard, M. E.; Garcia, J. A.; Young, B.; Snell, Q.; West, T. W.; Watson, S. R.; AlQaseer, M.; Louiset, M.; Maqsood, M. B.; Voutt-Goos, M. J.; Douma, C.; Kasbekar, N.; Jeffries, J.; Abu-Rahmeh, W.; Frush, K.; Grewal, D. K.; Bahsoun, M.; Leonard, M.; Frankel, A.; Classen, D. C.; Pestotnik, S. L.

2026-06-10 health informatics
10.64898/2026.06.05.26354271 medRxiv
Show abstract

Objective. To introduce PsiBench, a clinically validated medication-safety benchmark for evaluating large language models (LLMs) against the standards used to certify hospital computerized provider order entry (CPOE) and electronic health record (EHR) systems, and a non-overlapping three-tier evaluation framework separating highest-stakes discrimination, the operational CDS regime, and category-correct alerting. Materials and Methods. PsiBench comprises 492 medication-safety scenarios across 11 safety categories, created by clinical pharmacology experts whose work underpins an annualized testing procedure used by more than 2,000 U.S. hospitals. The three-tier framework partitions the scenarios non-overlappingly: Discrimination (98 scenarios, 50 fatal vs 48 deception, near-balanced 51%/49%); Operational (394 scenarios, 261 serious unsafe plus 133 safe including 41 Excessive Alerts reclassified as operational negatives); and Attribution (311 alert-required scenarios). We evaluated 40 frontier LLMs from 10 providers over 3 runs per scenario at temperature 0.2 (or the provider default where temperature is not configurable), yielding 59,040 evaluations conducted April 21-23, 2026. Results. Headline binary performance on the full benchmark spans a wide range across the 40 models: F1 78.5%-92.3%, accuracy 65.4%-89.8%, sensitivity 81.4%-100.0%, specificity 6.1%-81.8%. Leading models by F1 (o4-mini 92.3%; o3 92.2%) pair high sensitivity with meaningful specificity; three models saturate sensitivity at 100% but fall below 25% specificity, indistinguishable from a naive always-alert classifier. The wide spread on a single headline metric motivates tier-specific analyses, developed in a separate clinical paper. Discussion and Conclusion. PsiBench and the three-tier framework operationalize a rigorous evaluation rubric for LLM medication safety, grounded in two decades of national hospital audit experience. The framework generalizes to any binary medication-safety classifier (rule-based, conventional ML, or LLM-driven), supporting tier-aware model selection and post-deployment surveillance.

Matching journals

The top 2 journals account for 50% of the predicted probability mass.

1
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.1%
32.9%
2
npj Digital Medicine
97 papers in training set
Top 0.3%
18.6%
50% of probability mass above
3
JAMIA Open
37 papers in training set
Top 0.1%
8.4%
4
Scientific Reports
3102 papers in training set
Top 27%
4.3%
5
BMC Medical Informatics and Decision Making
39 papers in training set
Top 0.8%
3.6%
6
International Journal of Medical Informatics
25 papers in training set
Top 0.4%
3.6%
7
Journal of Biomedical Informatics
45 papers in training set
Top 0.5%
3.1%
8
Frontiers in Digital Health
20 papers in training set
Top 0.5%
2.1%
9
Computers in Biology and Medicine
120 papers in training set
Top 2%
1.9%
10
JMIR Medical Informatics
17 papers in training set
Top 0.7%
1.8%
11
JCO Clinical Cancer Informatics
18 papers in training set
Top 0.5%
1.7%
12
JMIR Public Health and Surveillance
45 papers in training set
Top 2%
1.5%
13
The Lancet Digital Health
25 papers in training set
Top 0.6%
1.3%
14
BMC Medical Research Methodology
43 papers in training set
Top 0.9%
1.2%
15
PLOS ONE
4510 papers in training set
Top 61%
1.1%
16
Journal of Medical Internet Research
85 papers in training set
Top 4%
0.9%
17
JAMA Network Open
127 papers in training set
Top 4%
0.8%
18
iScience
1063 papers in training set
Top 34%
0.7%
19
BMC Bioinformatics
383 papers in training set
Top 7%
0.7%
20
Inflammatory Bowel Diseases
15 papers in training set
Top 0.3%
0.6%
21
Cureus
67 papers in training set
Top 6%
0.6%
22
Frontiers in Artificial Intelligence
18 papers in training set
Top 0.9%
0.6%
23
Heliyon
146 papers in training set
Top 8%
0.6%