Toward trustworthy clinical AI for obsessive-compulsive disorder: reliability, generalizability, and interpretability of a transformer model across the ENIGMA-OCD consortium
Pak, M.; Ryu, Y.; Bae, S.; Anticevic, A.; Costa, A. D.; Thorsen, A. L.; van der Straten, A. L.; Couto, B.; Vai, B.; Hansen, B.; Soriano-Mas, C.; Li, C.-s. R.; Vriend, C.; Lochner, C.; Pittenger, C.; Moreau, C. A.; Rodriguez-Manrique, D.; Vecchio, D.; Shimizu, E.; Stern, E. R.; Munoz-Moreno, E.; Nurmi, E. L.; Piras, F.; Colombo, F.; Piras, F.; Jaspers-Fayer, F.; Benedetti, F.; Venkatasubramanian, G.; Eng, G. K.; Simpson, H. B.; Ruan, H.; Hu, H.; van Marle, H. J. F.; Tomiyama, H.; Martinez-Zalacain, I.; Feusner, J.; Narayanaswamy, J. C.; Yun, J.-Y.; Sato, J. R.; Ipser, J.; Pariente, J. C.; Mench
Show abstract
Background. Studies applying machine learning to obsessive-compulsive disorder (OCD) typically report accuracy in homogeneous samples but rarely assess model reliability, generalizability, and interpretability needed for clinical use. Methods. We applied a transformer-based deep learning model, the Multi-Band Brain Net, to the ENIGMA-OCD cohort - the largest available resting-state functional magnetic resonance imaging (rs-fMRI) dataset in OCD with 1,706 participants (869 cases with OCD, 837 controls) across 23 sites worldwide. We evaluated model reliability by calculating calibration - the model's ability to "know what it doesn't know". We assessed generalizability using leave-one-site-out validation to test performance on unseen sites with different scanners, acquisition protocols, and patient populations. Finally, we examined interpretability by analyzing model attention weights to identify the neural connectivity patterns that influence model predictions. Results. The model achieved modest but competitive classification performance (AUROC = .653, SD = .039). Crucially, while large-scale pretraining on the UK Biobank (N = 40,783) did not boost accuracy, it significantly enhanced model calibration by reducing overconfident predictions. Leave-one-site-out validation showed a generalization gap across sites (AUROC = .427-.819). Pretraining did not close this gap but removed scanner manufacturer bias. Finally, attention-based mapping identified biologically plausible patterns of widespread hypoconnectivity in OCD relative to healthy controls, particularly in low-frequency bands involving the default mode, salience, and somatomotor networks. These findings aligned with known OCD neurobiology. Conclusions. This study provides a framework for developing more reliable and trustworthy clinical artificial intelligence for OCD.
Matching journals
The top 3 journals account for 50% of the predicted probability mass.