Back

Heart Failure Prediction & Risk Stratification using Machine Learning

Ali, S.; Leavitt, M. A.; Asghar, W.

2026-04-05 public and global health
10.64898/2026.04.03.26350139 medRxiv
Show abstract

Heart failure (HF) is one of the most prevalent causes of morbidity, mortality, and healthcare expenditures, with approximately 6.7 million adults in the U.S. suffering from this condition and contributing to hundreds of thousands of deaths annually. Early diagnosis of high-risk individuals has been a challenge, as the HF-specific symptoms are often ignored or misinterpreted as normal aging, stress, or minor illnesses, leading to delayed diagnosis. We trained, tested, and evaluated several models, including logistic regression, SVM, KNN, random forest, XGBoost, MLP, and a custom stacked ensemble using stratified 5-fold CV and 70/30 hold-out splits for HF prediction on routinely available electronic medical record (EMR) data of the All of Us Research Program. This group consisted of 37,070 adults (13,577 HF; 23,493 non-HF). The predictors included readily available demographics, vital signs, conditions, and laboratory results. Preprocessing steps included IQR-winsorization, median imputation, one-hot encoding, and QuantileTransformer. The stacked model obtained ROC-AUC 0.927, PR-AUC 0.895, and accuracy 0.856 in the test set. To support real-world deployment, we calibrated predicted probabilities and adjusted them to a realistic population prevalence, yielding interpretable probability estimates and clear stratification of individuals into clinically actionable risk tiers. SHAP analysis identified the most influential features, namely, atrial fibrillation, age, hypertensive disorder, sodium, and deprivation index, as the top 5 features impacting the model?s prediction. A secondary multiclass experiment (No-HF, HF with reduced ejection fraction, and HF with preserved ejection fraction) was performed, achieving lower discrimination results (macro-AUC ~0.87) and a lower per-class precision/recall, presumably due to label noise, class imbalance, and overlapping phenotypes. We have demonstrated that a carefully calibrated stacked ensemble on the combination of readily available EMR variables can achieve strong discrimination on HF, making it an effective tool for an AI clinical decision support system (AI-CDSS) in population screening and proactive care pathways.

Matching journals

The top 4 journals account for 50% of the predicted probability mass.

1
npj Digital Medicine
97 papers in training set
Top 0.1%
25.7%
2
Nature Medicine
117 papers in training set
Top 0.1%
10.0%
3
European Heart Journal - Digital Health
15 papers in training set
Top 0.1%
8.1%
4
Scientific Reports
3102 papers in training set
Top 12%
7.1%
50% of probability mass above
5
PLOS ONE
4510 papers in training set
Top 29%
6.3%
6
Nature Communications
4913 papers in training set
Top 37%
3.9%
7
Journal of Biomedical Informatics
45 papers in training set
Top 0.5%
3.6%
8
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.8%
3.6%
9
Genome Medicine
154 papers in training set
Top 4%
2.1%
10
PLOS Digital Health
91 papers in training set
Top 1%
2.1%
11
Frontiers in Physiology
93 papers in training set
Top 2%
1.9%
12
JMIR Medical Informatics
17 papers in training set
Top 0.7%
1.8%
13
Journal of Medical Internet Research
85 papers in training set
Top 3%
1.7%
14
Circulation
66 papers in training set
Top 2%
1.6%
15
Frontiers in Cardiovascular Medicine
49 papers in training set
Top 2%
1.3%
16
Frontiers in Public Health
140 papers in training set
Top 6%
1.3%
17
eLife
5422 papers in training set
Top 47%
1.3%
18
Communications Biology
886 papers in training set
Top 15%
1.2%
19
Communications Medicine
85 papers in training set
Top 0.8%
0.9%
20
eBioMedicine
130 papers in training set
Top 5%
0.7%
21
Patterns
70 papers in training set
Top 3%
0.7%
22
PLOS Computational Biology
1633 papers in training set
Top 27%
0.6%
23
BMC Medical Informatics and Decision Making
39 papers in training set
Top 3%
0.6%