Back

Measuring the Unmeasurable: A Diagnostic Sensor for AI Reasoning Pathology in Sequential Clinical Decision-Making

Wang, S.

2026-03-30 health informatics
10.64898/2026.03.27.26349583 medRxiv
Show abstract

Large Language Models achieve impressive accuracy on medical benchmarks that present clinical information as complete vignettes, but their behavior under sequential information delivery, the standard mode of real clinical practice, is poorly characterized. We conduct a three-condition ablation study (N=50 NEJM-derived cases, 150 total runs) using claude-sonnet-4-20250514 to investigate what happens when diagnostic information arrives in stages rather than all at once. We introduce a novel 5+2 scoring rubric measuring seven dimensions of reasoning quality beyond binary accuracy, and a 6-code failure mode taxonomy enabling mechanistic root-cause analysis of diagnostic failures. We document Convergence Regression (CR): a systematic failure mode where models correctly identify diagnoses at intermediate reasoning stages but abandon them when subsequent evidence triggers pattern-matching to alternative diagnoses. Under unstructured sequential delivery, models access the correct diagnosis in 90% of cases but retain it in only 60%, creating a 30% Access-Stability Dissociation invisible under single-shot evaluation. A structured scaffold, the Sequential Information Prioritization Scaffold (SIPS), eliminates this gap entirely through forced hypothesis accountability: 80% access, 80% final accuracy, 0% Convergence Regression. We term this the SIPS Retention Effect. However, scaffolding reduces top-1 accuracy from 60% to 40%, a Convergence Hesitancy Paradox establishing that retention and convergence are architecturally distinct reasoning tasks requiring separate mechanisms. We propose that structured scaffolding functions as a diagnostic sensor for reasoning pathology rather than an accuracy intervention: it makes failure modes visible, classifiable, and auditable. We demonstrate that our measurement instruments operationalize WHO and FDA governance requirements for AI transparency, accountability, and safety into quantifiable scores. We release the complete framework, including the 5+2 rubric, 6-code taxonomy, scaffold specification, and 210-score matrix with adjudication rationale, as a reusable audit instrument for evaluating LLM reasoning behavior in any sequential reasoning context. The study evolved across three analytical phases: N=50 aggregate ablation establishing population-level scaffold effects; stratified N=10 mechanistic case analysis characterising the specific failure mode and its structural remedy; and N=10 cross-model replication across three architecturally distinct LLMs (Claude Sonnet 4, GPT-4o, Llama 3.3-70B) testing generalisability. A subsequent multi-model validation study confirms that core C3 process properties -- Hypothesis Tracking universality (5.0/5.0) and Step Adherence (4.9-5.0) -- replicate across GPT-4o and Llama 3.3-70B under identical protocols. The Convergence Hesitancy Paradox, while present in GPT-4o, is absent in Claude and Llama, establishing that the scaffold measures model-specific reasoning profiles rather than imposing a single fixed performance trade-off.

Matching journals

The top 8 journals account for 50% of the predicted probability mass.

1
npj Digital Medicine
97 papers in training set
Top 0.2%
19.1%
2
Nature Communications
4913 papers in training set
Top 25%
7.0%
3
Nature Medicine
117 papers in training set
Top 0.4%
5.0%
4
eLife
5422 papers in training set
Top 16%
5.0%
5
Philosophical Transactions of the Royal Society B
51 papers in training set
Top 0.9%
4.4%
6
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.6%
4.4%
7
PLOS ONE
4510 papers in training set
Top 35%
4.1%
8
Scientific Reports
3102 papers in training set
Top 34%
3.7%
50% of probability mass above
9
Nature
575 papers in training set
Top 6%
3.7%
10
PLOS Digital Health
91 papers in training set
Top 0.9%
2.8%
11
Cell Systems
167 papers in training set
Top 5%
2.8%
12
iScience
1063 papers in training set
Top 9%
2.1%
13
Nature Machine Intelligence
61 papers in training set
Top 2%
1.9%
14
GENETICS
189 papers in training set
Top 0.5%
1.9%
15
JCO Clinical Cancer Informatics
18 papers in training set
Top 0.4%
1.8%
16
Patterns
70 papers in training set
Top 1%
1.5%
17
Annals of Internal Medicine
27 papers in training set
Top 0.5%
1.4%
18
Nature Human Behaviour
85 papers in training set
Top 3%
1.3%
19
Med
38 papers in training set
Top 0.4%
1.3%
20
Communications Medicine
85 papers in training set
Top 0.5%
1.1%
21
Bioinformatics
1061 papers in training set
Top 8%
1.0%
22
The Lancet Digital Health
25 papers in training set
Top 0.8%
0.9%
23
Communications Psychology
20 papers in training set
Top 0.2%
0.9%
24
Nature Computational Science
50 papers in training set
Top 1%
0.8%
25
Nature Biomedical Engineering
42 papers in training set
Top 2%
0.8%
26
Nature Methods
336 papers in training set
Top 6%
0.8%
27
Neuron
282 papers in training set
Top 8%
0.8%
28
Cell
370 papers in training set
Top 16%
0.8%
29
BMJ Health & Care Informatics
13 papers in training set
Top 0.9%
0.7%
30
Nature Biotechnology
147 papers in training set
Top 9%
0.5%