Back

Dissecting clinical reasoning failures in frontier artificial intelligence using 10,000 synthetic cases

Auger, S. D.; Varley, J.; Hargovan, M.; Scott, G.

2026-04-23 neurology
10.64898/2026.04.22.26351488 medRxiv
Show abstract

Background: Current medical large language model (LLM) evaluations largely rely on small collections of cases, whereas rigorous safety testing requires large-scale, diverse, and complex cases with verifiable ground truth. Multiple Sclerosis (MS) provides an ideal evaluation model, with validated diagnostic criteria and numerous paraclinical tests informing differential diagnosis, investigation, and management. Methods: We generated synthetic MS cases with ground-truth labels for diagnosis, localisation, and management. Four frontier LLMs (Gemini 3 Pro/Flash, GPT 5.2/5 mini) were instructed to analyse cases to provide anatomical localisation, differential diagnoses, investigations, and management plans. An automated evaluator compared these outputs to the ground-truth labels. Blinded subspecialty experts validated 70 cases for realism and automated evaluator accuracy. We then evaluated LLM decision-making across 1,000 cases and scaled to 10,000 to characterise rare, catastrophic failures. Results: Subspecialist expert review confirmed 100% synthetic case realism and 99.8% (95% CI 95.5 to 100) automated evaluation accuracy. Across 1,000 generated MS cases, all LLMs successfully included MS in the differential diagnoses for more than 91% cases. However, diagnostic competence did not associate with treatment safety. Gemini 3 models had low rates of clinically appropriate steroid recommendations (Flash: 7.2% 95% CI 5.6 to 8.8; Pro: 15.8% 95% CI 13.6 to 18.1) compared to GPT 5 mini (23.5% 95% CI 20.8 to 26.1), frequently overlooking contraindications like active infection. OpenAI models inappropriately recommended acute intravenous thrombolysis for MS cases (9.6% GPT 5.2; 6.4% GPT 5 mini) compared to below 1% for Gemini models. Expanded evaluation (to 10,000 cases) probed these errors in detail. Thrombolysis was recommended in 10.1% of cases lacking symptom timing information and paradoxically persisted (2.9%) even when symptoms were explicitly documented as more than 14 days old. Conclusion: Automated expert-level evaluation across 10,000 cases characterised artificial intelligence clinical blind spots hitherto invisible to small-scale testing. Massive-scale simulation and automated interrogation should become standard for uncovering serious failures and implementing safety guardrails before clinical deployment exposes patients to risk.

Matching journals

The top 5 journals account for 50% of the predicted probability mass.

1
npj Digital Medicine
97 papers in training set
Top 0.1%
28.8%
2
Nature Medicine
117 papers in training set
Top 0.1%
8.7%
3
PLOS Digital Health
91 papers in training set
Top 0.4%
5.0%
4
Scientific Reports
3102 papers in training set
Top 27%
4.3%
5
Frontiers in Neurology
91 papers in training set
Top 2%
3.4%
50% of probability mass above
6
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.8%
3.2%
7
PLOS ONE
4510 papers in training set
Top 43%
2.8%
8
Journal of Neurology, Neurosurgery & Psychiatry
29 papers in training set
Top 0.5%
2.2%
9
Med
38 papers in training set
Top 0.2%
2.0%
10
Nature Communications
4913 papers in training set
Top 50%
1.8%
11
EClinicalMedicine
21 papers in training set
Top 0.2%
1.8%
12
The Lancet Digital Health
25 papers in training set
Top 0.4%
1.7%
13
Annals of Clinical and Translational Neurology
29 papers in training set
Top 0.7%
1.4%
14
BMJ Health & Care Informatics
13 papers in training set
Top 0.5%
1.4%
15
Brain Communications
147 papers in training set
Top 2%
1.0%
16
Multiple Sclerosis Journal
18 papers in training set
Top 0.2%
1.0%
17
Critical Care Explorations
15 papers in training set
Top 0.4%
0.9%
18
iScience
1063 papers in training set
Top 25%
0.9%
19
Multiple Sclerosis and Related Disorders
15 papers in training set
Top 0.2%
0.9%
20
Journal of Neurology
26 papers in training set
Top 1.0%
0.9%
21
PLOS Computational Biology
1633 papers in training set
Top 21%
0.9%
22
Brain
154 papers in training set
Top 4%
0.9%
23
BMC Neurology
12 papers in training set
Top 0.7%
0.9%
24
Epilepsia
49 papers in training set
Top 0.7%
0.8%
25
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 42%
0.8%
26
Journal of the Neurological Sciences
17 papers in training set
Top 0.7%
0.8%
27
Clinical Pharmacology & Therapeutics
25 papers in training set
Top 0.7%
0.8%
28
Epilepsy Research
12 papers in training set
Top 0.3%
0.8%
29
European Radiology
14 papers in training set
Top 0.7%
0.8%
30
BMC Medicine
163 papers in training set
Top 6%
0.8%