Widespread use of invalid statistical tests in biomedical machine learning
Zeng, T.; Li, H.; Zhang, S.; Tan, Y. Q.; Tian, F.; Orban, C.; An, L.; Che, W.; Cheng, J.; Chong, J. S. X.; Dehestani, N.; Dong, Z.; Li, X.; Li, Z.; Lim, M. J. R.; Lin, Y.; Ling, Q.; Ling, Z.; Low, X. Z.; Mansour L., S.; Ng, K. K.; Nguyen, T. T.; Ooi, L. Q. R.; Pande, S.; Qian, X.; Ruan, J.; Wang, Z.; Xie, Y.; Zhang, C.; Zhang, Y.; Patil, K.; Parkes, L.; Dhamala, E.; Chopra, S.; Zalesky, A.; Holmes, A.; Eickhoff, S.; Zhou, J. H.; Renaud, O.; Dosenbach, N.; Kording, K. P.; Bzdok, D.; Nichols, T.; Yeo, B. T. T.
Show abstract
Machine learning is accelerating biomedical research. Cross-validation is widely used to compare predictive performance - not only to benchmark algorithms, but also to inform scientific applications, such as ranking biomarkers. However, prediction performance estimates across cross-validation folds are not independent. Standard tests for comparing prediction performance (e.g., paired t-test) assume independence and can therefore inflate false positive rates. In a PRISMA-guided meta-analysis of 210 studies (impact factor [≥]15, 1 June 2020 - 1 June 2025), we find that 97% ignored fold dependence when comparing prediction performance. This problem is ubiquitous across scientific fields and unaffected by impact factor, rigor-promoting policies, or open science practices. Simulations across 420 scenarios spanning four diverse datasets show that ignoring fold dependence leads to invalid false positive control in most settings. Repeated cross-validation further compounds this problem, with false positive rates rising toward 100% as the number of repetitions grows. Existing fold-dependence-aware tests rely on strong assumptions because the variance of fold-level statistics and the between-fold correlation cannot be disentangled under standard cross-validation. We therefore propose the SHARP (Split-HAlf RePeated) test, a simple modification to standard cross-validation that enables direct estimation of variance and correlation. Benchmarked against 12 tests, SHARP provides the best overall balance of false-positive control, statistical power, and confidence-interval calibration across simulation schemes. We conclude by providing best practices and reporting guidelines for valid model comparison inference in biomedical machine learning and beyond.
Matching journals
The top 9 journals account for 50% of the predicted probability mass.