Back

EXHEART: A Fairness-Aware Explainable Stacked Ensemble for Cardiovascular Disease Classification with Cross-Instrument Disparity Attribution

Biswas, M. A.; Laila, A.

2026-06-05 health informatics
10.64898/2026.06.03.26354879 medRxiv
Show abstract

Background: Machine learning models trained on population health surveys offer scalable tools for cardiovascular screening, but recurring methodological weaknesses undermine their credibility and equity: data leakage from synthetic oversampling, qualitative rather than quantitative explainability evaluation, and the absence of demographic fairness auditing at the clinical operating threshold. Methods: We present EXHEART, a leakage-free stacked ensemble pipeline trained on BRFSS 2015 (n = 253,680) and validated on BRFSS 2020 (n = 319,795; temporal transport and retrain) and a clinical cardiovascular examination dataset (n = 68,730). The pipeline combines XGBoost, LightGBM, Random Forest, and a multi-layer perceptron as base learners with 5-fold out-of-fold logistic regression stacking and Platt scaling calibration. A quantitative SHAP-LIME consistency framework, based on Kendall-tau rank correlation and Jaccard overlap, accompanies a decision-curve analysis, a subgroup-stratified SHAP interaction analysis, and an intersectional fairness audit (Sex x Age x Income) with threshold-shifting mitigation and a frontier of the fairness-utility trade-off. The framework also adds cross-instrument fairness-disparity attribution, an empirical diagnostic that provides evidence on whether an observed subgroup disparity is more consistent with a measurement-induced or a substantive explanation by re-validating it on a dataset that measures the same clinical construct objectively. On heart disease, this diagnostic associates 89% of the sex TPR gap (95% CI [0.65, 0.99]) with the self-reported survey outcome rather than with a substantive risk difference. Results: On BRFSS 2015, EXHEART achieves AUC-ROC = 0.850, AUPRC = 0.371, Brier score = 0.071, and reduces ECE by 96% (0.256 to 0.011) via Platt scaling. Global SHAP-LIME rank agreement is moderate-to-strong (Kendall-tau = 0.580, Spearman-rho = 0.818) with a substantial top-3 divergence (Jaccard@3 = 0.200), where Stroke flips from SHAP rank 8 to LIME rank 1. The Sex TPR gap is 0.124 at the screening threshold; intersectional Sex x Age disparities reach 0.649 among adequately-powered cells, 5.2x the single-attribute gap. Temporal transport to BRFSS 2020 collapses sensitivity from 0.776 to 0.267, while retraining restores AUC = 0.840 and ECE = 0.012. On clinical examination data, the Sex TPR gap collapses to 0.014; the attribution test indicates this gap is instrument-dependent, consistent with a measurement or outcome-definition explanation rather than a substantive risk difference. Cross-domain SHAP analysis identifies four instrument-independent CVD risk factors and two major portability failures. Conclusions: EXHEART combines three practices that population-scale cardiovascular classifiers usually apply in isolation: leakage-free training with calibrated probabilities, a test of whether the model's explanations are stable, and a fairness audit that examines intersecting subgroups rather than single attributes. Bringing them together proved worthwhile. The intersectional audit revealed disparities that single-attribute auditing missed, and the cross-instrument comparison indicated that much of the sex gap reflects how the outcome is measured in survey data rather than a substantive difference in risk. The temporal transport findings indicate that deployed BRFSS models warrant periodic monitoring and retraining to maintain clinical utility. EXHEART is a retrospective methodological evaluation on public de-identified data; it is not validated for direct clinical decision-making, diagnosis, or treatment recommendation without prospective clinical validation.

Matching journals

The top 7 journals account for 50% of the predicted probability mass.

1
npj Digital Medicine
97 papers in training set
Top 0.3%
17.0%
2
Journal of the American Heart Association
119 papers in training set
Top 0.7%
9.8%
3
European Heart Journal - Digital Health
15 papers in training set
Top 0.1%
7.0%
4
Journal of Biomedical Informatics
45 papers in training set
Top 0.3%
4.7%
5
PLOS Digital Health
91 papers in training set
Top 0.6%
4.2%
6
The Lancet Digital Health
25 papers in training set
Top 0.1%
4.2%
7
Scientific Reports
3102 papers in training set
Top 39%
3.5%
50% of probability mass above
8
BMC Medical Research Methodology
43 papers in training set
Top 0.3%
3.5%
9
Nature Communications
4913 papers in training set
Top 42%
3.5%
10
PLOS ONE
4510 papers in training set
Top 42%
3.0%
11
BMC Medicine
163 papers in training set
Top 2%
2.7%
12
Journal of the American College of Cardiology
12 papers in training set
Top 0.2%
2.7%
13
Journal of the American Medical Informatics Association
61 papers in training set
Top 1%
1.7%
14
Circulation
66 papers in training set
Top 2%
1.6%
15
Circulation: Genomic and Precision Medicine
42 papers in training set
Top 0.8%
1.6%
16
BMJ Health & Care Informatics
13 papers in training set
Top 0.5%
1.6%
17
JMIR Medical Informatics
17 papers in training set
Top 0.9%
1.4%
18
Annals of Internal Medicine
27 papers in training set
Top 0.6%
1.3%
19
Frontiers in Artificial Intelligence
18 papers in training set
Top 0.4%
1.3%
20
JAMIA Open
37 papers in training set
Top 1%
1.2%
21
Communications Medicine
85 papers in training set
Top 0.5%
1.2%
22
European Journal of Epidemiology
40 papers in training set
Top 0.5%
1.2%
23
PLOS Computational Biology
1633 papers in training set
Top 21%
1.1%
24
Patterns
70 papers in training set
Top 2%
1.1%
25
BMJ
49 papers in training set
Top 1%
0.9%
26
Nature Medicine
117 papers in training set
Top 5%
0.8%
27
JAMA Network Open
127 papers in training set
Top 5%
0.7%
28
JMIR Public Health and Surveillance
45 papers in training set
Top 4%
0.7%
29
Nature Human Behaviour
85 papers in training set
Top 5%
0.7%
30
iScience
1063 papers in training set
Top 35%
0.7%