Back

Widespread use of invalid statistical tests in biomedical machine learning

Zeng, T.; Li, H.; Zhang, S.; Tan, Y. Q.; Tian, F.; Orban, C.; An, L.; Che, W.; Cheng, J.; Chong, J. S. X.; Dehestani, N.; Dong, Z.; Li, X.; Li, Z.; Lim, M. J. R.; Lin, Y.; Ling, Q.; Ling, Z.; Low, X. Z.; Mansour L., S.; Ng, K. K.; Nguyen, T. T.; Ooi, L. Q. R.; Pande, S.; Qian, X.; Ruan, J.; Wang, Z.; Xie, Y.; Zhang, C.; Zhang, Y.; Patil, K.; Parkes, L.; Dhamala, E.; Chopra, S.; Zalesky, A.; Holmes, A.; Eickhoff, S.; Zhou, J. H.; Renaud, O.; Dosenbach, N.; Kording, K. P.; Bzdok, D.; Nichols, T.; Yeo, B. T. T.

2026-05-20 bioinformatics
10.64898/2026.05.17.724301 bioRxiv
Show abstract

Machine learning is accelerating biomedical research. Cross-validation is widely used to compare predictive performance - not only to benchmark algorithms, but also to inform scientific applications, such as ranking biomarkers. However, prediction performance estimates across cross-validation folds are not independent. Standard tests for comparing prediction performance (e.g., paired t-test) assume independence and can therefore inflate false positive rates. In a PRISMA-guided meta-analysis of 210 studies (impact factor [≥]15, 1 June 2020 - 1 June 2025), we find that 97% ignored fold dependence when comparing prediction performance. This problem is ubiquitous across scientific fields and unaffected by impact factor, rigor-promoting policies, or open science practices. Simulations across 420 scenarios spanning four diverse datasets show that ignoring fold dependence leads to invalid false positive control in most settings. Repeated cross-validation further compounds this problem, with false positive rates rising toward 100% as the number of repetitions grows. Existing fold-dependence-aware tests rely on strong assumptions because the variance of fold-level statistics and the between-fold correlation cannot be disentangled under standard cross-validation. We therefore propose the SHARP (Split-HAlf RePeated) test, a simple modification to standard cross-validation that enables direct estimation of variance and correlation. Benchmarked against 12 tests, SHARP provides the best overall balance of false-positive control, statistical power, and confidence-interval calibration across simulation schemes. We conclude by providing best practices and reporting guidelines for valid model comparison inference in biomedical machine learning and beyond.

Matching journals

The top 9 journals account for 50% of the predicted probability mass.

1
Nature Machine Intelligence
61 papers in training set
Top 0.1%
12.3%
2
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 7%
8.4%
3
Patterns
70 papers in training set
Top 0.1%
6.3%
4
eLife
5422 papers in training set
Top 17%
4.8%
5
Briefings in Bioinformatics
326 papers in training set
Top 1%
4.8%
6
Scientific Reports
3102 papers in training set
Top 27%
4.3%
7
BMC Bioinformatics
383 papers in training set
Top 2%
4.2%
8
Bioinformatics
1061 papers in training set
Top 5%
4.0%
9
PLOS Computational Biology
1633 papers in training set
Top 9%
4.0%
50% of probability mass above
10
Communications Biology
886 papers in training set
Top 2%
3.6%
11
Cell Systems
167 papers in training set
Top 5%
2.9%
12
Nature Communications
4913 papers in training set
Top 45%
2.6%
13
NAR Genomics and Bioinformatics
214 papers in training set
Top 2%
1.9%
14
GigaScience
172 papers in training set
Top 1%
1.7%
15
Journal of the American Medical Informatics Association
61 papers in training set
Top 1%
1.7%
16
PLOS Biology
408 papers in training set
Top 9%
1.7%
17
Communications Medicine
85 papers in training set
Top 0.3%
1.5%
18
Bioinformatics Advances
184 papers in training set
Top 3%
1.3%
19
PLOS ONE
4510 papers in training set
Top 60%
1.2%
20
Computers in Biology and Medicine
120 papers in training set
Top 3%
1.1%
21
Genome Biology
555 papers in training set
Top 6%
0.9%
22
Nature Computational Science
50 papers in training set
Top 1%
0.9%
23
BioData Mining
15 papers in training set
Top 0.8%
0.8%
24
npj Systems Biology and Applications
99 papers in training set
Top 2%
0.8%
25
PeerJ
261 papers in training set
Top 15%
0.7%
26
Journal of Cheminformatics
25 papers in training set
Top 0.6%
0.7%
27
Nature Methods
336 papers in training set
Top 7%
0.7%
28
Scientific Data
174 papers in training set
Top 3%
0.6%
29
Science Advances
1098 papers in training set
Top 33%
0.6%