Back

Performance evaluation and benchmarking across 16 large language models on a comprehensive real-world emergency department triage data set

Benning, L.; Hirsch, A.; Groeschel, M.; Roeschl, T.; Spott, M.; Hans, F. P.; Urban, T.; Busch, H.-J.; Meyer, A.; Madrid, J.

2026-06-05 health informatics
10.64898/2026.05.28.26353935 medRxiv
Show abstract

Background Emergency department (ED) triage is a high-stakes clinical decision process that determines patient prioritization and resource allocation under time pressure. Large language models (LLMs) have recently been proposed as decision-support tools for triage, yet most evaluations rely on simulated scenarios or curated datasets. Evidence from real-world clinical environments remains limited. The objective of this project was to systematically evaluate the performance, calibration, and reproducibility of multiple contemporary large language models for Emergency Severity Index (ESI) classification and sectoral allocation (ED vs. urgent care practice, UCP) using a comprehensive real-world triage dataset. Material and Methods Retrospective cross-sectional benchmarking study conducted at a tertiary academic emergency ED in Germany with an integrated central point of assessment (CPA). The study included all consecutive adult walk-in encounters (>18 years) presenting between October 2023 and February 2024 (N = 16,107). Data were collected from a structured clinical decision support system capturing presenting complaints, vital signs, and triage decisions recorded by specialized nursing staff. Structured clinical variables routinely collected at triage, including presenting complaint categories (CEDIS-PCL), vital signs according to the ABCDE framework, and additional structured or free-text clinical information. Results The primary outcome was the agreement between LLM-predicted and nurse-assigned ESI levels measured using quadratic-weighted Cohen's k. Secondary outcomes included sectoral assignment agreement, misclassification patterns (over- and under-triage), calibration metrics, and output reproducibility. Quadratic-weighted k values ranged from 0.18 to 0.75 across models. Only a structured stepwise prompting strategy achieved substantial agreement (k_qw = 0.747), approaching reported human inter-rater reliability. Most models demonstrated moderate or lower agreement and systematic overconfidence, with expected calibration errors (ECE) based on verbalized confidence ranging from 0.099 to 0.355. Sectoral assignment agreement (i.e. ED vs. urgent care practice, UCP) was uniformly low (k < 0.30). Reproducibility testing revealed substantial variability in 23% of cases, indicating non-deterministic output behavior for clinically relevant decisions. Conclusions Current large language models demonstrate heterogeneous and generally limited performance in real-world emergency triage tasks. Structured algorithm-guided prompting appears more influential than model architecture or size. Before clinical implementation, improvements in calibration, reliability, and workflow integration are required, alongside regulatory-compliant validation in prospective clinical settings.

Matching journals

The top 7 journals account for 50% of the predicted probability mass.

1
npj Digital Medicine
97 papers in training set
Top 0.3%
17.8%
2
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.3%
9.7%
3
Scientific Reports
3102 papers in training set
Top 16%
6.5%
4
JMIR Medical Informatics
17 papers in training set
Top 0.2%
6.1%
5
International Journal of Medical Informatics
25 papers in training set
Top 0.3%
4.7%
6
Frontiers in Digital Health
20 papers in training set
Top 0.1%
4.7%
7
Journal of Medical Internet Research
85 papers in training set
Top 1%
4.1%
50% of probability mass above
8
BMJ Health & Care Informatics
13 papers in training set
Top 0.1%
3.8%
9
BMC Medical Informatics and Decision Making
39 papers in training set
Top 0.7%
3.8%
10
BMC Medical Research Methodology
43 papers in training set
Top 0.3%
3.4%
11
PLOS Digital Health
91 papers in training set
Top 0.9%
2.9%
12
PLOS ONE
4510 papers in training set
Top 44%
2.6%
13
JAMIA Open
37 papers in training set
Top 0.6%
2.3%
14
The Lancet Digital Health
25 papers in training set
Top 0.3%
2.0%
15
JMIR Public Health and Surveillance
45 papers in training set
Top 2%
1.6%
16
Artificial Intelligence in Medicine
15 papers in training set
Top 0.4%
1.4%
17
Emergency Medicine Journal
20 papers in training set
Top 0.3%
1.4%
18
Healthcare
16 papers in training set
Top 0.8%
1.4%
19
JAMA Network Open
127 papers in training set
Top 3%
1.2%
20
Journal of Biomedical Informatics
45 papers in training set
Top 1%
1.1%
21
Computers in Biology and Medicine
120 papers in training set
Top 4%
0.9%
22
Philosophical Transactions of the Royal Society B
51 papers in training set
Top 5%
0.9%
23
JMIR Formative Research
32 papers in training set
Top 1%
0.9%
24
Critical Care Explorations
15 papers in training set
Top 0.4%
0.9%
25
IEEE Journal of Biomedical and Health Informatics
34 papers in training set
Top 2%
0.8%
26
Frontiers in Artificial Intelligence
18 papers in training set
Top 0.7%
0.8%
27
BMC Medicine
163 papers in training set
Top 8%
0.7%
28
JCO Clinical Cancer Informatics
18 papers in training set
Top 0.9%
0.7%
29
Journal of NeuroEngineering and Rehabilitation
28 papers in training set
Top 1%
0.7%
30
BMJ Open
554 papers in training set
Top 13%
0.6%