Back

MedSafe-Dx (v0): A Safety-Focused Benchmark for Evaluating LLMs in Clinical Diagnostic Decision Support

Van Oyen, C.; Mirza-Haq, N.

2026-04-21 health informatics
10.64898/2026.04.14.26350711 medRxiv
Show abstract

MedSafe-Dx (v0), introduces a new safety-focused benchmark for evaluating large language models in clinical diagnostic decision support using a filtered subset of the DDx Plus dataset (N=250). MedSafe-Dx evaluates three dimensions: escalation sensitivity, avoidance of false reassurance, and calibration of uncertainty. Models were tasked with providing a ranked differential (ICD-10), an escalation decision (Urgent vs. Routine), and a confidence flag. Performance was measured via a "Safety Pass Rate," a composite metric penalizing three hard failure modes: missed escalations of life-threatening conditions, overconfident incorrect diagnoses, and unsafe reassurance in ambiguous cases. Eleven models were evaluated and revealed a significant disconnect between diagnostic recall and safety. GPT-5.2 achieved the highest Safety Pass Rate (97.6%), while several models exhibited high rates of missed escalations or unsafe reassurance. MedSafe-Dx provides a robust stress test for identifying high-risk failure modes in diagnostic decision support and shows that high diagnostic accuracy does not guarantee clinical safety. While the benchmark is currently limited by synthetic data and proxy labels, it provides a reproducible, auditable framework for testing AI behavior before clinical deployment. Our findings suggest that interventions such as safety-focused prompting and reasoning-token budgets could be essential components for the safe deployment of LLMs in clinical workflows.

Matching journals

The top 5 journals account for 50% of the predicted probability mass.

1
npj Digital Medicine
97 papers in training set
Top 0.1%
28.4%
2
Scientific Reports
3102 papers in training set
Top 9%
8.6%
3
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.4%
7.0%
4
JCO Clinical Cancer Informatics
18 papers in training set
Top 0.2%
4.4%
5
Journal of Biomedical Informatics
45 papers in training set
Top 0.5%
3.3%
50% of probability mass above
6
Bioinformatics
1061 papers in training set
Top 6%
2.8%
7
BMC Medical Informatics and Decision Making
39 papers in training set
Top 1%
2.7%
8
PLOS ONE
4510 papers in training set
Top 46%
2.4%
9
Artificial Intelligence in Medicine
15 papers in training set
Top 0.2%
2.4%
10
iScience
1063 papers in training set
Top 9%
2.2%
11
Frontiers in Digital Health
20 papers in training set
Top 0.4%
2.1%
12
PLOS Digital Health
91 papers in training set
Top 1%
1.9%
13
Nature Communications
4913 papers in training set
Top 48%
1.9%
14
IEEE Journal of Biomedical and Health Informatics
34 papers in training set
Top 0.9%
1.8%
15
Patterns
70 papers in training set
Top 0.8%
1.7%
16
BMC Bioinformatics
383 papers in training set
Top 4%
1.7%
17
International Journal of Medical Informatics
25 papers in training set
Top 0.9%
1.5%
18
GigaScience
172 papers in training set
Top 2%
1.3%
19
Nature Machine Intelligence
61 papers in training set
Top 3%
1.0%
20
Journal of Medical Internet Research
85 papers in training set
Top 4%
0.9%
21
JMIR Medical Informatics
17 papers in training set
Top 1%
0.9%
22
Nature Medicine
117 papers in training set
Top 4%
0.9%
23
The Lancet Digital Health
25 papers in training set
Top 0.8%
0.9%
24
Computer Methods and Programs in Biomedicine
27 papers in training set
Top 0.8%
0.8%
25
PLOS Computational Biology
1633 papers in training set
Top 23%
0.8%
26
BMJ Health & Care Informatics
13 papers in training set
Top 0.9%
0.8%
27
JAMIA Open
37 papers in training set
Top 2%
0.7%
28
Computers in Biology and Medicine
120 papers in training set
Top 5%
0.7%
29
Frontiers in Artificial Intelligence
18 papers in training set
Top 0.9%
0.7%
30
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 48%
0.5%