CREB: Consistent Reference External Batch Harmonization

Kharade, A.; PAN, Y.; Andreescu, C.; Karim, H. T.

2026-03-12 bioengineering

10.64898/2026.03.10.710874 bioRxiv

Show abstract

AO_SCPLOWBSTRACTC_SCPLOWMachine learning models using functional magnetic resonance imaging (fMRI) are becoming increasingly popular - these models often rely on training data from multiple, large, and publicly available datasets. It is often necessary to harmonize these data across sites and sequences, and algorithms like ComBat are frequently applied to correct for these differences. This has been shown to improve model performance and generalizability. However, applying traditional ComBat necessitates harmonizing all data (train, validation, test, and other unseen external test sets) simultaneously, which leads to potential data leakage and limits application to new unseen data. We introduce Consistent Reference External Batch (CREB) harmonization, a novel extension of ComBat that learns the prior distribution of site effects exclusively from a designated training set. This learned prior serves as a consistent, easily deployable reference point that employs the empirical Bayes framework to update the site effect for any new, external unseen data. This approach enables training, validation, and test sets to be harmonized separately, thereby preventing data leakage, ensuring the integrity of downstream analyses, and application to new unseen data. CREB is different from traditional ComBat in which each sites prior distribution is estimated at once, but this cannot be applied to unseen data or data from sites not included in the original set of data. We tested CREB with train data from 2846 participants (ages 18-97 years) across 9 different studies and test data from 1113 participants (ages 18-88 years) from 3 studies. We evaluated the performance of harmonization with functional connectivity and gray matter volume. We show that CREB can effectively harmonize the test data to the train data, and have comparable performance to ComBat. CREB is able to conduct this harmonization in a two-step procedure that prevents leakage and is deployable to new unseen data. Finally, we tested whether CREB could similarly preserve biological variance (e.g., whether age associations were preserved after harmonization). We found that CREB, like ComBat could preserve age associations with both functional connectivity and gray matter volume measures. CREB provides an easily deployable, robust harmonization method to standardize data to a common reference distribution, making it uniquely suitable for training generalizable machine learning models.

CREB: Consistent Reference External Batch Harmonization

Matching journals