Back

Failure detection in medical image classification under realistic distribution shifts: A large-scale benchmark

Steinmetz, P.; Frouin, F.; Morard, V.; Buvat, I.

2026-05-05 radiology and imaging
10.64898/2026.05.04.26350496 medRxiv
Show abstract

Medical images (MI) exhibit variability due to different acquisition protocols, devices, and patient populations, making failure detection at inference time essential for reliable deployment of clinical classifiers. As existing evaluations of failure detection methods use different settings, it is difficult to compare results and identify the best strategy, if any. We present a comprehensive benchmark of eight confidence scoring functions and two score-aggregation strategies across eight MI tasks spanning diverse modalities, backbone architectures, training setups, and failure sources. The confidence ranking ability and classification error mitigation are jointly evaluated. While no single method systematically dominated across settings, aggregation of confidence scores consistently matched or approached the best individual method and substantially reduced silent failure rate. The failure detection performance was strongly correlated with classifier accuracy for all tested settings. These findings provide large-scale evidence regarding the strengths and limitations of confidence scoring strategies and offer actionable guidance for mitigating silent failures under realistic distribution shifts in MI.

Matching journals

The top 9 journals account for 50% of the predicted probability mass.

1
Nature Machine Intelligence
61 papers in training set
Top 0.2%
10.1%
2
Scientific Reports
3102 papers in training set
Top 10%
8.4%
3
npj Digital Medicine
97 papers in training set
Top 0.7%
7.1%
4
IEEE Transactions on Medical Imaging
18 papers in training set
Top 0.1%
6.3%
5
Nature Communications
4913 papers in training set
Top 33%
4.8%
6
Patterns
70 papers in training set
Top 0.1%
4.2%
7
PLOS ONE
4510 papers in training set
Top 36%
3.9%
8
PLOS Digital Health
91 papers in training set
Top 0.6%
3.9%
9
NeuroImage
813 papers in training set
Top 3%
3.6%
50% of probability mass above
10
Expert Systems with Applications
11 papers in training set
Top 0.1%
2.6%
11
Nature Medicine
117 papers in training set
Top 2%
2.1%
12
Medical Physics
14 papers in training set
Top 0.3%
2.1%
13
Diagnostics
48 papers in training set
Top 0.8%
1.9%
14
Human Brain Mapping
295 papers in training set
Top 3%
1.9%
15
Computers in Biology and Medicine
120 papers in training set
Top 2%
1.8%
16
Journal of Medical Imaging
11 papers in training set
Top 0.1%
1.7%
17
PLOS Computational Biology
1633 papers in training set
Top 17%
1.7%
18
European Radiology
14 papers in training set
Top 0.4%
1.7%
19
eBioMedicine
130 papers in training set
Top 2%
1.5%
20
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 35%
1.5%
21
Neurocomputing
13 papers in training set
Top 0.3%
1.3%
22
Medical Image Analysis
33 papers in training set
Top 0.7%
1.3%
23
Imaging Neuroscience
242 papers in training set
Top 2%
1.3%
24
Communications Medicine
85 papers in training set
Top 0.5%
1.2%
25
Science Advances
1098 papers in training set
Top 23%
1.2%
26
IEEE Transactions on Biomedical Engineering
38 papers in training set
Top 0.7%
0.9%
27
Biomedical Optics Express
84 papers in training set
Top 0.9%
0.9%
28
Journal of Biomedical Informatics
45 papers in training set
Top 1%
0.9%
29
Journal of Pathology Informatics
13 papers in training set
Top 0.3%
0.8%
30
Science Translational Medicine
111 papers in training set
Top 6%
0.8%