Back

Benchmarking Clinical Reasoning in Large Language Models: A Comparative Assessment Study

Prade, T.; Samwald, M.

2026-03-15 health informatics
10.64898/2026.03.13.26347597 medRxiv
Show abstract

Evaluation of Large Language Models (LLMs) and their clinical competence has mainly focused on conventional multiple-choice (MCQ) formatted medical question answering exams, yielding benchmarks like MedQA-USMLE, where models have already exceeded expert-level performance. However, alternative assessment methods have recently been proposed, such as SCT-Bench based on Script Concordance Testing (SCT), which evaluates clinical reasoning and probabilistic thinking under uncertainty. Reasoning-optimized models have unexpectedly scored worse on SCT-Bench despite outperforming non-reasoning models on other medical benchmarks. This study compared performance metrics, uncertainty proxies and clinical reasoning qualities between MedQA-USMLE and the public subset of SCT-Bench using instruction-tuned GPT-4.1, contrasting baseline and Chain-of-Thought (CoT) prompting across sampled responses. CoT prompts were designed to explicitly instruct the model to apply cognitive clinical reasoning strategies, with their usage subsequently evaluated across both benchmark formats. CoT prompting improved MedQA performance from 86.4% to 93.0%, while SCT-Bench score showed a non-significant decline from 77.7% to 74.7%. GPT-4.1 systematically overestimated the impact of new information under CoT, leading to overconfidence and increased extreme ratings on SCT questions. Sample-based majority voting significantly improved MedQA scores under CoT but had no meaningful effect on SCT-Bench. Response entropy analysis showed that CoT increased overall answer variability, while simultaneously clustering correct responses on MedQA, an effect absent on SCT-Bench. Calibration and ROC were substantially poorer on SCT-Bench than on MedQA, though CoT improved both on either benchmark. Qualitative analysis confirmed GPT-4.1 could apply situation-appropriate reasoning strategies and showed signs of metacognitive awareness about its own reasoning process, with expert rating patterns suggesting possible alignment with expert-like logic. These findings further corroborate limitations in elicited clinical reasoning for SCT-based benchmarking and suggest that reasoning-aware evaluation frameworks could contribute meaningfully to the medical AI benchmark landscape.

Matching journals

The top 8 journals account for 50% of the predicted probability mass.

1
Scientific Reports
3102 papers in training set
Top 3%
14.2%
2
Computers in Biology and Medicine
120 papers in training set
Top 0.1%
10.0%
3
npj Digital Medicine
97 papers in training set
Top 0.7%
8.1%
4
PLOS Digital Health
91 papers in training set
Top 0.5%
4.8%
5
PLOS ONE
4510 papers in training set
Top 34%
4.1%
6
Biology Methods and Protocols
53 papers in training set
Top 0.2%
3.9%
7
iScience
1063 papers in training set
Top 5%
3.5%
8
Philosophical Transactions of the Royal Society B
51 papers in training set
Top 1%
3.5%
50% of probability mass above
9
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.8%
3.5%
10
Frontiers in Digital Health
20 papers in training set
Top 0.4%
2.6%
11
BMC Medical Informatics and Decision Making
39 papers in training set
Top 1%
2.1%
12
BMJ Health & Care Informatics
13 papers in training set
Top 0.3%
2.1%
13
Artificial Intelligence in Medicine
15 papers in training set
Top 0.2%
2.1%
14
Healthcare
16 papers in training set
Top 0.4%
2.1%
15
International Journal of Medical Informatics
25 papers in training set
Top 0.7%
1.9%
16
Journal of Personalized Medicine
28 papers in training set
Top 0.4%
1.7%
17
BMC Bioinformatics
383 papers in training set
Top 5%
1.6%
18
JCO Clinical Cancer Informatics
18 papers in training set
Top 0.5%
1.6%
19
GigaScience
172 papers in training set
Top 2%
1.6%
20
Journal of Medical Internet Research
85 papers in training set
Top 3%
1.5%
21
Computational and Structural Biotechnology Journal
216 papers in training set
Top 6%
1.3%
22
Computer Methods and Programs in Biomedicine
27 papers in training set
Top 0.6%
1.2%
23
Bioinformatics
1061 papers in training set
Top 9%
0.9%
24
Data in Brief
13 papers in training set
Top 0.3%
0.9%
25
Frontiers in Artificial Intelligence
18 papers in training set
Top 0.6%
0.9%
26
Royal Society Open Science
193 papers in training set
Top 4%
0.8%
27
PLOS Computational Biology
1633 papers in training set
Top 24%
0.8%
28
JMIR Medical Informatics
17 papers in training set
Top 1%
0.8%
29
Heliyon
146 papers in training set
Top 6%
0.8%
30
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 45%
0.7%