Back

Dissecting clinical reasoning failures in frontier artificial intelligence using 10,000 synthetic cases

Auger, S. D.; Varley, J.; Hargovan, M.; Scott, G.

2026-04-23 neurology
10.64898/2026.04.22.26351488 medRxiv
Show abstract

BackgroundCurrent medical large language model (LLM) evaluations largely rely on small collections of cases, whereas rigorous safety testing requires large-scale, diverse, and complex cases with verifiable ground truth. Multiple Sclerosis (MS) provides an ideal evaluation model, with validated diagnostic criteria and numerous paraclinical tests informing differential diagnosis, investigation, and management. MethodsWe generated synthetic MS cases with ground-truth labels for diagnosis, localisation, and management. Four frontier LLMs (Gemini 3 Pro/Flash, GPT-5.2/5-mini) were instructed to analyse cases to provide anatomical localisation, differential diagnoses, investigations, and management plans. An automated evaluator compared these outputs to the ground-truth labels. Blinded subspecialty experts validated 70 cases for realism and automated evaluator accuracy. We then evaluated LLM decision-making across 1,000 cases and scaled to 10,000 to characterise rare, catastrophic failures. ResultsSubspecialist expert review confirmed 100% synthetic case realism and 99.8% (95% CI 95.5-100) automated evaluation accuracy. Across 1,000 generated MS cases, all LLMs successfully included MS in the differential diagnoses for >91% cases. However, diagnostic competence did not associate with treatment safety. Gemini 3 models had low rates of clinically appropriate steroid recommendations (Flash: 7.2% [95% CI 5.6-8.8]; Pro: 15.8% [13.6-18.1]) compared to GPT-5-mini (23.5% [20.8-26.1]), frequently overlooking contraindications like active infection. OpenAI models inappropriately recommended acute intravenous thrombolysis for MS cases (9.6% GPT-5.2; 6.4% GPT-5 mini) compared to <1% for Gemini models. Expanded evaluation (to 10,000 cases) probed these errors in detail. Thrombolysis was recommended in 10.1% of cases lacking symptom timing information and paradoxically persisted (2.9%) even when symptoms were explicitly documented as >14 days old. ConclusionAutomated expert-level evaluation across 10,000 cases characterised artificial intelligence clinical blind spots hitherto invisible to small-scale testing. Massive-scale simulation and automated interrogation should become standard for uncovering serious failures and implementing safety guardrails before clinical deployment exposes patients to risk. 1-2 SENTENCE DESCRIPTIONBy scaling an expert-validated simulation process to 10,000 cases, this study demonstrates that high diagnostic accuracy by AI can mask rare but dangerous safety failures. This large-scale approach provides a framework for uncovering clinical "blind spots" that small-scale evaluations miss, helping inform the development of safety guardrails before AI is deployed in practice.

Matching journals

The top 1 journal accounts for 50% of the predicted probability mass.

1
npj Digital Medicine
97 papers in training set
Top 0.1%
56.3%
50% of probability mass above
2
Nature Medicine
117 papers in training set
Top 0.4%
5.3%
3
PLOS Digital Health
91 papers in training set
Top 0.4%
5.3%
4
Scientific Reports
3102 papers in training set
Top 48%
2.3%
5
Journal of the American Medical Informatics Association
61 papers in training set
Top 1%
1.9%
6
PLOS ONE
4510 papers in training set
Top 57%
1.4%
7
Computers in Biology and Medicine
120 papers in training set
Top 3%
1.3%
8
PLOS Computational Biology
1633 papers in training set
Top 21%
1.0%
9
Med
38 papers in training set
Top 0.5%
1.0%
10
iScience
1063 papers in training set
Top 24%
1.0%
11
The Lancet Digital Health
25 papers in training set
Top 0.9%
0.9%
12
BMC Medical Research Methodology
43 papers in training set
Top 1%
0.8%
13
BioData Mining
15 papers in training set
Top 0.7%
0.8%
14
Nature Communications
4913 papers in training set
Top 61%
0.8%
15
International Journal of Medical Informatics
25 papers in training set
Top 1%
0.8%
16
Journal of Clinical Epidemiology
28 papers in training set
Top 0.5%
0.8%
17
Journal of Neurology, Neurosurgery & Psychiatry
29 papers in training set
Top 1%
0.8%
18
Healthcare
16 papers in training set
Top 2%
0.8%
19
Frontiers in Public Health
140 papers in training set
Top 7%
0.8%
20
BMC Medicine
163 papers in training set
Top 6%
0.8%
21
Acta Psychiatrica Scandinavica
10 papers in training set
Top 0.3%
0.8%
22
Epilepsy Research
12 papers in training set
Top 0.3%
0.8%
23
Epilepsia
49 papers in training set
Top 0.7%
0.8%
24
BMJ Health & Care Informatics
13 papers in training set
Top 1%
0.7%
25
Network Neuroscience
116 papers in training set
Top 1%
0.7%
26
Journal of NeuroEngineering and Rehabilitation
28 papers in training set
Top 1%
0.5%
27
Brain
154 papers in training set
Top 5%
0.5%