Combining Machine Learning with Cox models for identifying risk factors for incident post-menopausal breast cancer in the UK Biobank
Liu, X.; Collister, J. A.; Littlejohns, T. J.; Morelli, D.; Clifton, D. A.; Hunter, D. J.; Clifton, L.
Show abstract
1.Breast cancer is the most common cancer in women. A better understanding of risk factors plays a central role in disease prediction and prevention. We aimed to identify potential novel risk factors for breast cancer among post-menopausal women, with pre-specified interest in the role of polygenic risk scores (PRS) for risk prediction. We designed an analysis pipeline combining both machine learning (ML) and classical statistical models with emphasis on necessary statistical considerations (e.g. collinearity, missing data). Extreme gradient boosting (XGBoost) machine with Shapley (SHAP) feature importance measures were used for risk factor discovery among [~]1.7k features in 104,313 post-menopausal women from the UK Biobank cohort. Cox models were constructed subsequently for in-depth investigation. Both PRS were significant risk factors when fitted simultaneously in both ML and Cox models (p < 0.001). ML analyses identified 11 (excluding the two PRS) novel predictors, among which five were confirmed by the Cox models: plasma urea (HR=0.95, 95% CI 0.92-0.98, p < 0.001) and plasma phosphate (HR=0.67, 95% CI 0.52-0.88, p = 0.003) were inversely associated with risk of developing post-menopausal breast cancer, whereas basal metabolic rate (HR=1.15, 95% CI 1.08-1.22, p < 0.001), red blood cell count (HR=1.20, 95% CI 1.08-1.34, p = 0.001), and creatinine in urine (HR=1.05, 95% CI 1.01-1.09, p = 0.008) were positively associated. Our final Cox model demonstrated a slight improvement in risk discrimination when adding novel features to a simpler Cox model containing PRS and the established risk factors (Harrells C-index = 0.670 vs 0.665).
Matching journals
The top 6 journals account for 50% of the predicted probability mass.