Back

Harmonising Structural Brain MRI from Multiple Sites with Limited Sample Sizes

Bhalerao, G. V.; Markiewicz, P.; Turnbull, J.; Thomas, D. L.; De Vita, E.; Parkes, L.; Thompson, G.; MacKewn, J.; Krokos, G.; Wimberley, C.; Hallett, W.; Su, L.; Malhotra, P.; Hoggard, N.; Taylor, J.-P.; Brooks, D.; Ritchie, C.; Wardlaw, J.; Matthews, P.; Aigbirho, F.; O'Brien, J.; Hammers, A.; Herholz, K.; Barkhof, F.; Miller, K.; Matthews, J.; Smith, S.; Griffanti, L.

2026-04-22 radiology and imaging
10.64898/2026.04.21.26351106 medRxiv
Show abstract

Harmonisation is widely used to mitigate site- and scanner-related batch variability in multisite neuroimaging studies and is particularly critical in longitudinal clinical trials, where detection of subtle biological or treatment-related changes depends on reliable measurement across scanners and timepoints. However, the effectiveness of harmonisation in small, heterogeneous clinical datasets remains insufficiently understood, particularly in relation to subject-level variability and consistency across acquisition settings, and its impact on both removal of technical variability and preservation of biological variation in pooled multisite analyses. We systematically evaluated a range of image-based and statistical harmonisation methods using a clinically realistic multisite, multiscanner structural T1-weighted (T1w) MRI test-retest dataset comprising three controlled acquisition scenarios: repeatability, intra-scanner reproducibility and inter-scanner reproducibility. Methods were applied under different batch specifications (site, scanner, or both) and performance was assessed within each scenario and in pooled data using a multi-metric framework capturing both technical and biological variability in volumetric imaging-derived phenotypes (IDPs) relevant to aging and dementia research. Across IDPs, before harmonisation variability was lowest in the repeatability scenario (median variability=0.6 to 2.7%, rank consistency {rho} [≥]0.9), with modest increases under intra-scanner reproducibility (0.5 to 3.2%, {rho}=0.5 to 1.0) and substantially greater variability under inter-scanner reproducibility conditions (1.7 to 19.2%, {rho} =-0.1 to 0.9). These results offer important information to consider for multisite study design, including sample size calculation in clinical trials. Harmonisation performance was strongly context dependent, with clearer benefits emerged in inter-scanner scenarios where both variability reduction and improvements in subject-level consistency were observed. In pooled data, approaches that explicitly modelled site as batch and accounted for repeated-measure structure showed greater consistency across IDPs in batch effect mitigation and more accurately reflected underlying biological variation. Our evaluation metrics enabled disentangling the removal of global batch effect while highlighting residual variability at the phenotype-specific or multivariate levels. These findings demonstrate that harmonisation cannot be treated as a one-size-fits-all solution and must be interpreted relative to the acquisition context, dataset structure, and downstream analytic goals. Multi-metric evaluation under realistic clinical constraints is essential to support reliable and translatable neuroimaging inference by ensuring appropriate correction of batch effects while preserving longitudinal biological signals and sensitivity to clinically meaningful change in multisite studies.

Matching journals

The top 2 journals account for 50% of the predicted probability mass.