Back

Context-Aware Emergency Department Triage Using Pairwise Comparisons and Bradley-Terry Aggregation

Jarrett, P.; Reeder, J.; McDonald, S.; Diercks, D.; Jamieson, A. R.

2026-03-17 health informatics
10.64898/2026.03.14.26348412 medRxiv
Show abstract

STRUCTURED ABSTRACTO_ST_ABSObjectiveC_ST_ABSTo evaluate a ranking approach for emergency department (ED) waiting room prioritization that uses pairwise clinical comparisons aggregated via a Bradley-Terry model, and to assess its cross-site stability without site-specific training. Materials and MethodsUsing the Multimodal Clinical Monitoring in the Emergency Department (MC-MED) dataset (118,385 ED visits, Site A), we defined a composite deterioration outcome (intensive care unit [ICU] admission, intubation, vasopressor, ventilation, or death within 6 hours) and evaluated 7 queue-ordering policies across 1,000 simulated shifts. The primary endpoint was Recall@5 for deteriorators; secondary endpoints included area under the receiver operating characteristic curve (AUROC) and simulated time-to-provider (TTP) metrics. External validation used MIMIC-IV-ED (425,087 visits, Site B) with 500 shifts. Methods reported per TRIPOD-LLM. ResultsOn MC-MED, BT-LLM-Enriched (Bradley-Terry ranking with a large language model [LLM] judge, GPT-4.1, using full diagnoses and medications) exceeded the Emergency Severity Index (ESI) on the primary endpoint: Recall@5 0.587 vs. 0.491 (p<0.001). XGBoost achieved Recall@5 0.648 but required large site-specific labeled training data. On external validation, supervised model performance attenuated (XGBoost AUROC 0.892 to 0.807) while BT-LLM-Enriched remained stable (0.826 to 0.831); the two were statistically indistinguishable on external data. DiscussionUnder external validation, supervised model performance attenuated while zero-shot LLM ranking remained stable, suggesting cross-site stability without requiring site-specific training data. ConclusionPairwise ranking with an LLM judge significantly outperforms ESI-based ordering and remains stable across sites without local training, matching supervised models on external data.

Matching journals

The top 3 journals account for 50% of the predicted probability mass.

1
npj Digital Medicine
97 papers in training set
Top 0.2%
22.8%
2
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.1%
18.8%
3
Journal of Medical Internet Research
85 papers in training set
Top 0.4%
10.2%
50% of probability mass above
4
JMIR Medical Informatics
17 papers in training set
Top 0.1%
6.4%
5
Scientific Reports
3102 papers in training set
Top 17%
6.4%
6
BMC Medical Informatics and Decision Making
39 papers in training set
Top 0.6%
4.4%
7
JAMIA Open
37 papers in training set
Top 0.6%
2.5%
8
The Lancet Digital Health
25 papers in training set
Top 0.2%
2.4%
9
BMJ Health & Care Informatics
13 papers in training set
Top 0.3%
1.9%
10
Journal of Biomedical Informatics
45 papers in training set
Top 0.7%
1.9%
11
PLOS ONE
4510 papers in training set
Top 51%
1.8%
12
International Journal of Medical Informatics
25 papers in training set
Top 0.8%
1.7%
13
BMJ Open
554 papers in training set
Top 9%
1.7%
14
JMIR Public Health and Surveillance
45 papers in training set
Top 2%
1.5%
15
Critical Care Explorations
15 papers in training set
Top 0.3%
1.2%
16
Nature Communications
4913 papers in training set
Top 58%
1.0%
17
Emergency Medicine Journal
20 papers in training set
Top 0.5%
0.9%
18
CMAJ Open
12 papers in training set
Top 0.2%
0.8%
19
Healthcare
16 papers in training set
Top 2%
0.8%
20
Frontiers in Digital Health
20 papers in training set
Top 1%
0.8%
21
BMC Medical Research Methodology
43 papers in training set
Top 1%
0.8%
22
PLOS Digital Health
91 papers in training set
Top 3%
0.7%
23
Medicine
30 papers in training set
Top 3%
0.7%
24
BMC Medicine
163 papers in training set
Top 9%
0.5%