Cross-Cohort Generalizability of Plasma Biomarker Machine Learning Models Reveals Calibration-Driven Degradation in Clinical Utility

Korni, A.; Zandi, E.

2026-04-13 neurology

10.64898/2026.04.09.26350514 medRxiv

Show abstract

BackgroundPlasma biomarkers demonstrate strong within-cohort performance for identifying cerebral amyloid pathology, but their real-world clinical utility depends on generalization across populations and assay platforms. The impact of cross-cohort deployment on clinically actionable metrics such as negative predictive value (NPV) remains poorly characterized. ObjectiveTo evaluate the performance and portability of plasma biomarker-based machine learning models for amyloid PET prediction across independent cohorts, with emphasis on calibration and clinically relevant predictive values. MethodsData from ADNI (n=885) and A4 (n=822) were analyzed. Machine learning models were trained within each cohort to predict amyloid PET status and continuous amyloid burden (centiloids). Performance was assessed using ROC AUC, accuracy, R{superscript 2}, and RMSE. Cross-cohort generalizability was evaluated using bidirectional transfer without retraining. Calibration, predictive values, and decision curve analysis were used to assess clinical utility. ResultsWithin-cohort discrimination was high (AUC up to 0.913 in ADNI and 0.870 in A4), with moderate performance for centiloid prediction (R{superscript 2} up to 0.628 and 0.535, respectively). Cross-cohort deployment resulted in modest attenuation of AUC ([~]4-7%) but substantially greater degradation in clinically actionable performance. NPV declined from 0.831 to 0.644 under ADNI[->]A4 transfer ([~]19 percentage points) despite preserved discrimination. Calibration analyses demonstrated systematic probability misestimation, and decision curve analysis showed reduced net clinical benefit. Biomarker distribution differences across cohorts were consistent with dataset shift. ConclusionPlasma biomarker models retain discrimination across cohorts but exhibit clinically meaningful degradation in predictive value under deployment. Calibration instability and prevalence differences critically affect NPV, highlighting the need for cross-cohort validation, calibration assessment, and assay harmonization before clinical implementation.

Cross-Cohort Generalizability of Plasma Biomarker Machine Learning Models Reveals Calibration-Driven Degradation in Clinical Utility

Matching journals