Dissecting clinical reasoning failures in frontier artificial intelligence using 10,000 synthetic cases
Auger, S. D.; Varley, J.; Hargovan, M.; Scott, G.
Show abstract
BackgroundCurrent medical large language model (LLM) evaluations largely rely on small collections of cases, whereas rigorous safety testing requires large-scale, diverse, and complex cases with verifiable ground truth. Multiple Sclerosis (MS) provides an ideal evaluation model, with validated diagnostic criteria and numerous paraclinical tests informing differential diagnosis, investigation, and management. MethodsWe generated synthetic MS cases with ground-truth labels for diagnosis, localisation, and management. Four frontier LLMs (Gemini 3 Pro/Flash, GPT-5.2/5-mini) were instructed to analyse cases to provide anatomical localisation, differential diagnoses, investigations, and management plans. An automated evaluator compared these outputs to the ground-truth labels. Blinded subspecialty experts validated 70 cases for realism and automated evaluator accuracy. We then evaluated LLM decision-making across 1,000 cases and scaled to 10,000 to characterise rare, catastrophic failures. ResultsSubspecialist expert review confirmed 100% synthetic case realism and 99.8% (95% CI 95.5-100) automated evaluation accuracy. Across 1,000 generated MS cases, all LLMs successfully included MS in the differential diagnoses for >91% cases. However, diagnostic competence did not associate with treatment safety. Gemini 3 models had low rates of clinically appropriate steroid recommendations (Flash: 7.2% [95% CI 5.6-8.8]; Pro: 15.8% [13.6-18.1]) compared to GPT-5-mini (23.5% [20.8-26.1]), frequently overlooking contraindications like active infection. OpenAI models inappropriately recommended acute intravenous thrombolysis for MS cases (9.6% GPT-5.2; 6.4% GPT-5 mini) compared to <1% for Gemini models. Expanded evaluation (to 10,000 cases) probed these errors in detail. Thrombolysis was recommended in 10.1% of cases lacking symptom timing information and paradoxically persisted (2.9%) even when symptoms were explicitly documented as >14 days old. ConclusionAutomated expert-level evaluation across 10,000 cases characterised artificial intelligence clinical blind spots hitherto invisible to small-scale testing. Massive-scale simulation and automated interrogation should become standard for uncovering serious failures and implementing safety guardrails before clinical deployment exposes patients to risk. 1-2 SENTENCE DESCRIPTIONBy scaling an expert-validated simulation process to 10,000 cases, this study demonstrates that high diagnostic accuracy by AI can mask rare but dangerous safety failures. This large-scale approach provides a framework for uncovering clinical "blind spots" that small-scale evaluations miss, helping inform the development of safety guardrails before AI is deployed in practice.
Matching journals
The top 1 journal accounts for 50% of the predicted probability mass.