Back

Scalable screening for emergency department missed opportunities for diagnosis using sequential eTriggers and large language models

Marks, C. M.; Gibney, S.; Stenson, B.; Sarma, D.; Gaudet, C.; Mombini, H.; Buckley, T.; Burke, L.; Shapiro, N. I.; Burstein, J. K.; Grossman, S. A.; Parab, A.; Janke, A. T.; Manrai, A.; Taylor, R. A.; Rosen, C. L.; Rodman, A.; Haimovich, A. D.

2025-10-07 emergency medicine
10.1101/2025.10.06.25337201 medRxiv
Show abstract

ImportanceMissed opportunities for diagnosis (MODs), sometimes termed diagnostic errors, are a major cause of patient morbidity and mortality in the emergency department (ED). EDs have employed eTriggers, rule-based case collections likely to have a higher than average error rate (e.g. 72 hour returns with admission), but their utility is limited by low error yields. Large language models (LLMs) offer new opportunities to identify MODs and contribute to both individual- and systems-level quality improvement. ObjectiveTo determine whether sequential screening of ED cases with eTriggers and an LLM can more efficiently identify MODs compared to eTriggers alone. DesignRetrospective observational cohort study of ED encounters collected between March 2015 and June 2025. Setting10 EDs (2 academic, 8 community) in a single US health system. ParticipantsEmergency physicians reviewed and adjudicated random samples of cases identified by 3 previously validated eTriggers (72-hour return with admission, 10-day return with ICU admission, and floor-to-ICU escalation within 24 hours) using the SaferDX instrument. An ED physician also evaluated a novel hybrid eTrigger combining an LLM adjudicator with a rules engine for 9-day return admissions with emergency care- sensitive conditions (ECSCs). ExposuresLLM MOD adjudication of ED cases with Claude Sonnet 4 using an iteratively-developed, standardized prompt incorporating the SaferDx instrument. Main Outcome(s) and Measure(s)Positive predictive value (PPV), sensitivity, specificity, negative predictive value (NPV), and number needed to screen (NNS) for MODs. Reviewer time to adjudicate cases and quality improvement stakeholder assessments of LLM case summaries were also measured. ResultsOf the 357 encounters (mean [SD] age, 65.2 [17.8] years; 47.1% female) reviewed, adjudicated MOD PPV ranged from 11.0% to 18.6% across traditional eTriggers. For 72-hour return admissions, the LLM achieved sensitivity 85.7% (95% CI, 65.4%-95.0%), specificity 56.8% (95% CI, 49.3%-64.0%), PPV 19.8%, and NPV 97.0%. For 10-day ICU returns, sensitivity was 100% (95% CI, 56.6%-100%), specificity 43.5% (95% CI, 25.6%-63.2%), PPV 27.8%, and NPV 100%. For floor-to-ICU escalations, sensitivity was 55.6% (95% CI, 33.7%-75.4%), specificity 64.6% (95% CI, 53.6%-74.2%), PPV 26.3%, and NPV 86.4%. The hybrid ECSC eTrigger identified 110 MODs (53.1% of 207 encounters), with blinded review of a stratified sample estimating PPV 45% and NPV 100%. Expert reviewers required a median of 5 minutes per case; restricting review to LLM-positive charts reduced review time by up to 50% without missed errors for these triggers. In stakeholder review, LLM-generated case summaries were rated highly actionable for individual clinician feedback (mean, 4.1 of 5) but less so for systems-level interventions (mean, 1.4 of 5). Conclusions and RelevanceIn this multisite retrospective study, LLMs demonstrated high NPVs across multiple eTrigger criteria. Sequential use of LLM and human review improved efficiency and detection compared with traditional eTriggers, and narrative case summaries offered a novel method to identify opportunities for clinician-level feedback. These findings suggest that LLM-based approaches may provide scalable diagnostic quality oversight in the ED. Key PointsO_ST_ABSQuestionC_ST_ABSCan sequential screening with eTriggers and a large-language-model (LLM) identify missed opportunities for diagnosis (MODs) in the emergency department, improving screening efficiency versus traditional eTriggers? FindingsIn a multicenter retrospective cohort (10 EDs; 317 reviewed encounters), LLM adjudication showed high sensitivity and NPV across three established eTriggers (e.g., 72-hour returns: sensitivity 85.7%, NPV 97.0; 10-day ICU returns: sensitivity 100%, NPV 100%). A sequential approach was validated on a novel eTrigger for 9-day returns for select emergency care sensitive conditions, achieving PPV 45% and NPV 100% in 40 blinded samples. MeaningLLM-augmented eTrigger screening offers scalable, efficient MOD detection to support diagnostic quality oversight in EDs.

Matching journals

The top 6 journals account for 50% of the predicted probability mass.

1
Journal of General Internal Medicine
20 papers in training set
Top 0.1%
13.0%
2
Emergency Medicine Journal
20 papers in training set
Top 0.1%
13.0%
3
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.3%
8.6%
4
CMAJ Open
12 papers in training set
Top 0.1%
6.5%
5
PLOS ONE
4510 papers in training set
Top 30%
5.0%
6
Journal of Medical Internet Research
85 papers in training set
Top 0.9%
5.0%
50% of probability mass above
7
JAMA Network Open
127 papers in training set
Top 0.5%
5.0%
8
npj Digital Medicine
97 papers in training set
Top 0.9%
5.0%
9
PLOS Digital Health
91 papers in training set
Top 0.8%
3.3%
10
Scientific Reports
3102 papers in training set
Top 47%
2.4%
11
BMJ Open
554 papers in training set
Top 7%
2.4%
12
Critical Care Explorations
15 papers in training set
Top 0.2%
2.1%
13
JMIR Medical Informatics
17 papers in training set
Top 0.6%
1.8%
14
BMC Health Services Research
42 papers in training set
Top 1%
1.7%
15
International Journal of Medical Informatics
25 papers in training set
Top 0.8%
1.7%
16
Journal of Biomedical Informatics
45 papers in training set
Top 0.8%
1.7%
17
BMJ
49 papers in training set
Top 0.6%
1.5%
18
JMIR Formative Research
32 papers in training set
Top 0.9%
1.5%
19
The Journal of Infectious Diseases
182 papers in training set
Top 3%
1.5%
20
Frontiers in Public Health
140 papers in training set
Top 6%
1.3%
21
Genetics in Medicine
69 papers in training set
Top 1.0%
0.8%
22
The Lancet Digital Health
25 papers in training set
Top 1%
0.8%
23
PLOS Biology
408 papers in training set
Top 20%
0.7%
24
Artificial Intelligence in Medicine
15 papers in training set
Top 0.9%
0.5%
25
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 48%
0.5%
26
Nature Communications
4913 papers in training set
Top 67%
0.5%