Back

Combining Machine Learning with Cox models for identifying risk factors for incident post-menopausal breast cancer in the UK Biobank

Liu, X.; Collister, J. A.; Littlejohns, T. J.; Morelli, D.; Clifton, D. A.; Hunter, D. J.; Clifton, L.

2022-06-27 oncology
10.1101/2022.06.27.22276932 medRxiv
Show abstract

1.Breast cancer is the most common cancer in women. A better understanding of risk factors plays a central role in disease prediction and prevention. We aimed to identify potential novel risk factors for breast cancer among post-menopausal women, with pre-specified interest in the role of polygenic risk scores (PRS) for risk prediction. We designed an analysis pipeline combining both machine learning (ML) and classical statistical models with emphasis on necessary statistical considerations (e.g. collinearity, missing data). Extreme gradient boosting (XGBoost) machine with Shapley (SHAP) feature importance measures were used for risk factor discovery among [~]1.7k features in 104,313 post-menopausal women from the UK Biobank cohort. Cox models were constructed subsequently for in-depth investigation. Both PRS were significant risk factors when fitted simultaneously in both ML and Cox models (p < 0.001). ML analyses identified 11 (excluding the two PRS) novel predictors, among which five were confirmed by the Cox models: plasma urea (HR=0.95, 95% CI 0.92-0.98, p < 0.001) and plasma phosphate (HR=0.67, 95% CI 0.52-0.88, p = 0.003) were inversely associated with risk of developing post-menopausal breast cancer, whereas basal metabolic rate (HR=1.15, 95% CI 1.08-1.22, p < 0.001), red blood cell count (HR=1.20, 95% CI 1.08-1.34, p = 0.001), and creatinine in urine (HR=1.05, 95% CI 1.01-1.09, p = 0.008) were positively associated. Our final Cox model demonstrated a slight improvement in risk discrimination when adding novel features to a simpler Cox model containing PRS and the established risk factors (Harrells C-index = 0.670 vs 0.665).

Matching journals

The top 6 journals account for 50% of the predicted probability mass.

1
eLife
5422 papers in training set
Top 6%
9.9%
2
Nature Communications
4913 papers in training set
Top 19%
9.9%
3
npj Breast Cancer
18 papers in training set
Top 0.1%
9.9%
4
Breast Cancer Research
32 papers in training set
Top 0.1%
8.3%
5
JNCI Cancer Spectrum
10 papers in training set
Top 0.1%
6.3%
6
Scientific Reports
3102 papers in training set
Top 19%
6.3%
50% of probability mass above
7
The Journal of Clinical Endocrinology & Metabolism
35 papers in training set
Top 0.4%
3.5%
8
Frontiers in Genetics
197 papers in training set
Top 3%
2.7%
9
Cancer Epidemiology, Biomarkers & Prevention
17 papers in training set
Top 0.2%
2.6%
10
European Journal of Cancer
10 papers in training set
Top 0.1%
2.0%
11
Annals of Oncology
13 papers in training set
Top 0.4%
2.0%
12
JCO Precision Oncology
14 papers in training set
Top 0.2%
1.7%
13
Frontiers in Oncology
95 papers in training set
Top 2%
1.7%
14
International Journal of Cancer
42 papers in training set
Top 0.7%
1.7%
15
iScience
1063 papers in training set
Top 17%
1.6%
16
Cancers
200 papers in training set
Top 3%
1.5%
17
BMC Cancer
52 papers in training set
Top 2%
1.3%
18
Journal of Medical Genetics
28 papers in training set
Top 0.4%
1.3%
19
PLOS Computational Biology
1633 papers in training set
Top 20%
1.2%
20
British Journal of Cancer
42 papers in training set
Top 1%
0.9%
21
JCO Clinical Cancer Informatics
18 papers in training set
Top 0.7%
0.9%
22
Communications Biology
886 papers in training set
Top 22%
0.8%
23
International Journal of Epidemiology
74 papers in training set
Top 2%
0.8%
24
The American Journal of Human Genetics
206 papers in training set
Top 4%
0.8%
25
Cancer Medicine
24 papers in training set
Top 1%
0.8%
26
Communications Medicine
85 papers in training set
Top 1%
0.7%
27
Metabolites
50 papers in training set
Top 1%
0.7%
28
PLOS ONE
4510 papers in training set
Top 69%
0.7%
29
EMBO Molecular Medicine
85 papers in training set
Top 5%
0.7%
30
Cell Reports Medicine
140 papers in training set
Top 8%
0.7%