Back

SHAP-Guided CpG Selection with Ensemble Learning for Epigenetic Age Prediction

Kaulagi, S.; Chavan, H.

2026-02-23 genomics
10.64898/2026.02.20.707142 bioRxiv
Show abstract

Epigenetic biomarkers offer critical insight into biological aging and disease risk, yet most deep learning models lack interpretability and generalization across tissues. We present a reproducible pipeline for interpretable age classification using SHAP-guided CpG prioritization, enhancer and gene annotation, and stacked ensemble modeling. Across both blood and brain samples (GSE41826, GSE40279), certain CpGs showed reproducible age-linked methylation changes. Comparative performance metrics, SHAP breakdowns, and CpG-level stability analyses support their potential as cross-tissue anchor sites.. A multi-model ensemble combining XGBoost, MLP, TabTransformer[->]XGBoost, and LightGBM yielded high predictive accuracy (92.4%) and macro F1 of 92.3%. Biological support for these findings stems from motif scans, enrichment results, and visual mapping of CpG-to-gene relationships using Sankey diagrams. Delta-based stacking improved prediction confidence in borderline age groups, notably boosting middle-age recall through complementary model behavior. This work lays the groundwork for explainable epigenetic clocks that transcend tissue boundaries.

Matching journals

The top 7 journals account for 50% of the predicted probability mass.

1
Nature Aging
51 papers in training set
Top 0.1%
14.9%
2
Nature Communications
4913 papers in training set
Top 22%
8.5%
3
Aging Cell
144 papers in training set
Top 0.8%
8.5%
4
Computational and Structural Biotechnology Journal
216 papers in training set
Top 0.8%
4.9%
5
Aging
69 papers in training set
Top 0.4%
4.9%
6
Frontiers in Genetics
197 papers in training set
Top 1%
4.9%
7
PLOS Computational Biology
1633 papers in training set
Top 9%
3.6%
50% of probability mass above
8
Cell Reports
1338 papers in training set
Top 14%
3.6%
9
Scientific Reports
3102 papers in training set
Top 41%
3.1%
10
npj Aging
15 papers in training set
Top 0.3%
2.6%
11
eLife
5422 papers in training set
Top 33%
2.4%
12
Advanced Science
249 papers in training set
Top 8%
2.1%
13
Genome Medicine
154 papers in training set
Top 3%
2.1%
14
Communications Biology
886 papers in training set
Top 6%
1.9%
15
PLOS ONE
4510 papers in training set
Top 50%
1.9%
16
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 32%
1.7%
17
Briefings in Bioinformatics
326 papers in training set
Top 4%
1.7%
18
Clinical Epigenetics
53 papers in training set
Top 0.6%
1.3%
19
Cell Genomics
162 papers in training set
Top 4%
1.3%
20
Nature Medicine
117 papers in training set
Top 3%
1.2%
21
GeroScience
97 papers in training set
Top 1%
1.1%
22
Nature Machine Intelligence
61 papers in training set
Top 3%
1.1%
23
Bioinformatics
1061 papers in training set
Top 9%
0.9%
24
Science Advances
1098 papers in training set
Top 26%
0.9%
25
Cell Reports Methods
141 papers in training set
Top 4%
0.8%
26
Alzheimer's Research & Therapy
52 papers in training set
Top 2%
0.8%
27
Bioinformatics Advances
184 papers in training set
Top 5%
0.8%
28
Genetic Epidemiology
46 papers in training set
Top 0.8%
0.8%
29
Nucleic Acids Research
1128 papers in training set
Top 18%
0.7%
30
The Journals of Gerontology, Series A: Biological Sciences and Medical Sciences
22 papers in training set
Top 0.4%
0.7%