Context-Aware Emergency Department Triage Using Pairwise Comparisons and Bradley-Terry Aggregation
Jarrett, P.; Reeder, J.; McDonald, S.; Diercks, D.; Jamieson, A. R.
Show abstract
STRUCTURED ABSTRACTO_ST_ABSObjectiveC_ST_ABSTo evaluate a ranking approach for emergency department (ED) waiting room prioritization that uses pairwise clinical comparisons aggregated via a Bradley-Terry model, and to assess its cross-site stability without site-specific training. Materials and MethodsUsing the Multimodal Clinical Monitoring in the Emergency Department (MC-MED) dataset (118,385 ED visits, Site A), we defined a composite deterioration outcome (intensive care unit [ICU] admission, intubation, vasopressor, ventilation, or death within 6 hours) and evaluated 7 queue-ordering policies across 1,000 simulated shifts. The primary endpoint was Recall@5 for deteriorators; secondary endpoints included area under the receiver operating characteristic curve (AUROC) and simulated time-to-provider (TTP) metrics. External validation used MIMIC-IV-ED (425,087 visits, Site B) with 500 shifts. Methods reported per TRIPOD-LLM. ResultsOn MC-MED, BT-LLM-Enriched (Bradley-Terry ranking with a large language model [LLM] judge, GPT-4.1, using full diagnoses and medications) exceeded the Emergency Severity Index (ESI) on the primary endpoint: Recall@5 0.587 vs. 0.491 (p<0.001). XGBoost achieved Recall@5 0.648 but required large site-specific labeled training data. On external validation, supervised model performance attenuated (XGBoost AUROC 0.892 to 0.807) while BT-LLM-Enriched remained stable (0.826 to 0.831); the two were statistically indistinguishable on external data. DiscussionUnder external validation, supervised model performance attenuated while zero-shot LLM ranking remained stable, suggesting cross-site stability without requiring site-specific training data. ConclusionPairwise ranking with an LLM judge significantly outperforms ESI-based ordering and remains stable across sites without local training, matching supervised models on external data.
Matching journals
The top 3 journals account for 50% of the predicted probability mass.