Measuring the Unmeasurable: A Diagnostic Sensor for AI Reasoning Pathology in Sequential Clinical Decision-Making
Wang, S.
Show abstract
Large Language Models achieve impressive accuracy on medical benchmarks that present clinical information as complete vignettes, but their behavior under sequential information delivery, the standard mode of real clinical practice, is poorly characterized. We conduct a three-condition ablation study (N=50 NEJM-derived cases, 150 total runs) using claude-sonnet-4-20250514 to investigate what happens when diagnostic information arrives in stages rather than all at once. We introduce a novel 5+2 scoring rubric measuring seven dimensions of reasoning quality beyond binary accuracy, and a 6-code failure mode taxonomy enabling mechanistic root-cause analysis of diagnostic failures. We document Convergence Regression (CR): a systematic failure mode where models correctly identify diagnoses at intermediate reasoning stages but abandon them when subsequent evidence triggers pattern-matching to alternative diagnoses. Under unstructured sequential delivery, models access the correct diagnosis in 90% of cases but retain it in only 60%, creating a 30% Access-Stability Dissociation invisible under single-shot evaluation. A structured scaffold, the Sequential Information Prioritization Scaffold (SIPS), eliminates this gap entirely through forced hypothesis accountability: 80% access, 80% final accuracy, 0% Convergence Regression. We term this the SIPS Retention Effect. However, scaffolding reduces top-1 accuracy from 60% to 40%, a Convergence Hesitancy Paradox establishing that retention and convergence are architecturally distinct reasoning tasks requiring separate mechanisms. We propose that structured scaffolding functions as a diagnostic sensor for reasoning pathology rather than an accuracy intervention: it makes failure modes visible, classifiable, and auditable. We demonstrate that our measurement instruments operationalize WHO and FDA governance requirements for AI transparency, accountability, and safety into quantifiable scores. We release the complete framework, including the 5+2 rubric, 6-code taxonomy, scaffold specification, and 210-score matrix with adjudication rationale, as a reusable audit instrument for evaluating LLM reasoning behavior in any sequential reasoning context. The study evolved across three analytical phases: N=50 aggregate ablation establishing population-level scaffold effects; stratified N=10 mechanistic case analysis characterising the specific failure mode and its structural remedy; and N=10 cross-model replication across three architecturally distinct LLMs (Claude Sonnet 4, GPT-4o, Llama 3.3-70B) testing generalisability. A subsequent multi-model validation study confirms that core C3 process properties -- Hypothesis Tracking universality (5.0/5.0) and Step Adherence (4.9-5.0) -- replicate across GPT-4o and Llama 3.3-70B under identical protocols. The Convergence Hesitancy Paradox, while present in GPT-4o, is absent in Claude and Llama, establishing that the scaffold measures model-specific reasoning profiles rather than imposing a single fixed performance trade-off.
Matching journals
The top 8 journals account for 50% of the predicted probability mass.