Solving Emergency Department Triage with Small Language Models

Belski, V.; Lukina, K.

2026-05-05 health policy

10.64898/2026.05.04.26352355 medRxiv

Show abstract

Emergency department (ED) triage assigns patients a five-level Emergency Severity Index (ESI) score that determines care priority. We investigate the feasibility of automating this process, comparing large commercial models (GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro, MedGemma) against a purpose-built pipeline combining a small extraction model with a deterministic clinical engine, and a 9B-parameter language model trained with structured chain-of-thought supervision and reinforcement learning. Off-the-shelf large models achieve only 45-55% exact ESI accuracy while being impractical for clinical deployment due to privacy constraints, cost, and latency. Our specialized BiomedBERT [4] pipeline achieves 88.9% exact accuracy with 97.2% adjacent accuracy ({+/-}1 ESI) on a 50-case expert-labeled evaluation set, approaching nurse inter-rater agreement. A Qwen3.5-9B model [16] fine-tuned with chain-of-thought supervision achieves 75.0% exact / 97.2% adjacent accuracy on a 36-case narrative evaluation. Ongoing GRPO training [13] with a clinically asymmetric reward function and 2,776 ESI-1 narrative training cases (previously 22, due to a discovered extraction bug) shows strong early reward signal. We document 37+ BERT experiments, multiple LLM training cycles, systematic data quality audits, and the specific engineering decisions that enabled progress, including the discovery that 71% of training labels for altered mental status were false positives.

Solving Emergency Department Triage with Small Language Models

Matching journals