Interpretable Lifestyle-Based Machine Learning Models for Ten-Year Cardiovascular Risk Prediction using data from the UK Biobank

Feng, Y.; Kunz, H.; Dziopa, K.

2026-02-01 health informatics

10.64898/2026.01.26.26344438 medRxiv

Show abstract

BackgroundCardiovascular diseases (CVDs) remain the leading global cause of morbidity and mortality. In clinical practice, 10-year risk prediction tools such as the Pooled Cohort Equations, QRISK3, and SCORE2 are widely used because of their transparency and clinical trustworthiness, but they rely heavily on biomarkers and medical history. Hence, most recommendations concentrate on pharmaceutical or procedural management, and in many situations, crucial biomarker indicators are unavailable, making it difficult to precisely evaluate individual risk and select appropriate treatments. ObjectiveTo develop interpretable, lifestyle-based machine learning models for predicting 10-year risk of cardiovascular disease (including heart failure and atrial fibrillation), and more critically, to systematically compare interpretability algorithms and assess the cross-model consistency of the identified behavioural factors MethodsUsing UK Biobank data, logistic regression, random forest, and XGBoost models were trained on lifestyle (including sleep, smoking, diet, physical activity and electronic device use) and demographic variables only. Discrimination, calibration and interpretability were evaluated using permutation importance, SHapley Additive Explanations and Local Interpretable Model-agnostic Explanations), with subgroup analyses by sex and age to characterise heterogeneity in model behaviour and feature relevance. ResultsThe developed models demonstrated good discrimination, with XGBoost performing best (ROC-AUC 0.726 [95% CI 0.720-0.731]; PR-AUC 0.199), closely followed by logistic regression (ROC-AUC 0.721 [95% CI 0.716-0.726]; PR-AUC 0.192), while random forest showed slightly lower performance. Despite this similar performance, interpretability analyses revealed inconsistencies in models importance ranking of lifestyle factors. Age, sex, and smoking behaviours consistently emerged as key contributors across all interpretability methods, demonstrating strong cross-model agreement, while other lifestyle factors such as dietary patterns, physical activity, and sleep showed model-dependent variation in their assigned importance. Subgroup analyses further indicated that modifiable behaviours (smoking, diet, sleep) were particularly influential among younger females, whereas cumulative exposures and family history were more dominant drivers in older males. ConclusionsLifestyle-only interpretable models offer a scalable and low-cost framework for cardiovascular risk assessment and behaviour-focused prevention, without requiring laboratory measurements or clinical testing. By comparing multiple interpretability algorithms across models, this study shows strong cross-method consistency and highlights lifestyle factors whose importance profiles differ from those in traditional biomarker-based calculators. These models can complement existing risk tools by highlighting modifiable behaviours, which is particularly valuable for younger adults. They can also support personalised feedback in digital-health settings to promote behavioural change. Overall, the findings support the development of transparent, behaviour-focused tools that enable accessible and equitable cardiovascular prevention.

Interpretable Lifestyle-Based Machine Learning Models for Ten-Year Cardiovascular Risk Prediction using data from the UK Biobank

Matching journals