Back

Cross-Model Variability in Large Language Model Triage Behavior for Potential Stroke Symptoms

Dworkis, D. A.; Stenstrom, J.; Sen, A.; Lucarelli, R. T.

2026-05-25 emergency medicine
10.64898/2026.05.22.26353904 medRxiv
Show abstract

Background: Stroke is a time-sensitive neurological emergency in which early EMS activation and presentation to definitive care are cornerstones of effective therapy. Large language models (LLMs) are increasingly consulted by the public for medical advice, but the veracity of the guidance provided by commercially available models responding to potential stroke symptoms is not well understood. Methods: We performed a cross-model benchmarking study comparing the triage choices of three frontier LLMs (Claude Sonnet 4.6, GPT-4o, and Llama 3.3-70b-versatile) on first-person vignettes describing a unilateral arm symptom on waking, across 10 symptom descriptors, and two clinical phases (before and after a partially reassuring self-examination), with or without a clinical distractor (n=50 per condition). Results: Claude sought emergency care most often, Llama least, and GPT-4o in between, diverging most sharply in the post-examination phase where Claude called 911 in 100% of runs, Llama called for non-emergency help in 100%, and GPT-4o was symptom-dependent. A distractor shifted behavior away from emergency care in almost all conditions: calling 911 fell from 37.9% to 14.6% and waiting rose from 0% to 45.9% in the post-examination vignette. Responses were also sensitive to symptom word: weak, limp, heavy, and clumsy generated higher alarm, whereas numb, tingly, odd, strange, and weird generated less urgent responses. Conclusions: The increasing use of LLMs for medical advice has significant public health implications. Commercially available LLMs show significant model-to-model variability and framing sensitivity when confronted with potential stroke symptoms, including under-recognition of canonical CDC warning descriptors, underscoring the need for systematic benchmarking as these tools become de facto first points of contact for patients experiencing neurological emergencies.

Matching journals

The top 3 journals account for 50% of the predicted probability mass.

1
Frontiers in Neurology
91 papers in training set
Top 0.1%
23.2%
2
PLOS ONE
4510 papers in training set
Top 11%
15.2%
3
Stroke
35 papers in training set
Top 0.1%
13.1%
50% of probability mass above
4
Emergency Medicine Journal
20 papers in training set
Top 0.1%
10.7%
5
Scientific Reports
3102 papers in training set
Top 25%
4.7%
6
Journal of the American Heart Association
119 papers in training set
Top 2%
3.8%
7
JAMA Network Open
127 papers in training set
Top 0.9%
3.7%
8
Journal of General Internal Medicine
20 papers in training set
Top 0.4%
1.9%
9
CMAJ Open
12 papers in training set
Top 0.1%
1.7%
10
Frontiers in Public Health
140 papers in training set
Top 5%
1.4%
11
Journal of the American Medical Informatics Association
61 papers in training set
Top 1%
1.4%
12
Healthcare
16 papers in training set
Top 1%
1.3%
13
BioData Mining
15 papers in training set
Top 0.5%
1.1%
14
Stroke: Vascular and Interventional Neurology
13 papers in training set
Top 0.3%
1.0%
15
npj Digital Medicine
97 papers in training set
Top 3%
0.9%
16
Journal of Stroke and Cerebrovascular Diseases
12 papers in training set
Top 0.4%
0.9%
17
Artificial Intelligence in Medicine
15 papers in training set
Top 0.5%
0.9%
18
Heliyon
146 papers in training set
Top 9%
0.5%
19
Journal of Medical Internet Research
85 papers in training set
Top 5%
0.5%
20
Journal of NeuroEngineering and Rehabilitation
28 papers in training set
Top 1%
0.5%
21
PLOS Digital Health
91 papers in training set
Top 3%
0.5%
22
International Journal of Medical Informatics
25 papers in training set
Top 2%
0.5%