Back

Prioritizing DNA methylation biomarkers using graph neural networks and explainable AI

Kumar, A.; Do, T. A.; Gruening, B.; Becker, H.; Backofen, R.

2026-01-27 bioinformatics
10.64898/2026.01.26.701692 bioRxiv
Show abstract

DNA methylation is a significant epigenetic modification involving the addition of a methyl group to the position 5' of the cytosine residues. The modification is responsible for disease progression, immune response, and outcomes in diseases such as breast cancer (BC) and acute myeloid leukemia (AML). Illuminas HumanMethylation450 BeadChip (450K) and EPIC BeadChip (850K) methylation arrays are heavily used for such cancer studies to determine differentially expressed and differentially methylated genomic regions. Many of these are biomarkers used effectively for exploring therapeutic targets. Several studies report a few potential biomarkers, but the enormous numbers of largely unexplored probe-level (CpG sites) methylation signals may contain additional significant biomarkers. To prioritise the under-explored and disease-specific CpG sites from DNA methylation arrays and potentially uncover novel biomarkers, we present the novel approach GraphMeX-plain, a graph neural network (GNN)-based approach with explainable AI module. The underlying graph neural network is a principal neighbourhood aggregation (PNA). The approach uses the biomarkers reported in recent studies to rank biomarkers from the unexplored set. A similarity graph between CpG sites (known and unexplored sets) is constructed using DNA methylation {beta} values from arrays, producing an interaction graph. Biomarkers from recent studies are used as seeds and from the unexplored CpG sites, highly-variable ones (excluding the seeds) are selected that vary significantly between conditions (BC patients and normal controls for breast cancer arrays). Using the combination of seed and highly-variable CpG sites, a positive-unlabeled approach, network-informed adaptive positive-unlabeled learning (NIAPU), is utilized to assign a set of soft labels to unknown CpG sites such as likely positive, weakly negative, likely negative, and reliable negative in the descending order of likelihood of CpG sites being potential biomarkers. The graph neural network, a multi-layer PNA, refines the soft label assignments and achieves a high F1 classification score (weighted) of 0.93 for BC and 0.91 for AML. The most likely set of CpG sites, classified under "likely positive", are further explored using GNNExplainer, an explainable AI approach. Subgraphs for likely positive CpG sites, predicted with high probabilities, are computed and their proximities to the original seed CpG sites are analysed. The CpG sites which are predicted as likely positives have close interactions to the seeds. The top likely positive CpG site for BC is cg13265740 (C6orf115) where gene C6orf115 is strongly associated with BC. For AML, the top likely positive predicted CpG site is cg23281527 (KLHDC7A) where gene KLHDC7A plays a strong role in the mechanism of AML. A high percentage of these likely positive predicted CpG sites for both BC and AML, which remained unseen by the GNN model during training, are highly relevant to them and can serve as potential therapeutic targets and prognostic values.

Matching journals

The top 5 journals account for 50% of the predicted probability mass.

1
Bioinformatics
1061 papers in training set
Top 0.7%
29.0%
2
BMC Bioinformatics
383 papers in training set
Top 1%
7.5%
3
Frontiers in Genetics
197 papers in training set
Top 1%
5.1%
4
Briefings in Bioinformatics
326 papers in training set
Top 1%
5.1%
5
Bioinformatics Advances
184 papers in training set
Top 0.8%
4.6%
50% of probability mass above
6
Scientific Reports
3102 papers in training set
Top 33%
3.8%
7
PLOS Computational Biology
1633 papers in training set
Top 9%
3.8%
8
PLOS ONE
4510 papers in training set
Top 43%
3.0%
9
IEEE Journal of Biomedical and Health Informatics
34 papers in training set
Top 0.6%
2.9%
10
Computers in Biology and Medicine
120 papers in training set
Top 1%
2.9%
11
IEEE Transactions on Computational Biology and Bioinformatics
17 papers in training set
Top 0.1%
2.2%
12
Computational and Structural Biotechnology Journal
216 papers in training set
Top 3%
2.0%
13
npj Systems Biology and Applications
99 papers in training set
Top 1.0%
1.8%
14
iScience
1063 papers in training set
Top 13%
1.8%
15
International Journal of Molecular Sciences
453 papers in training set
Top 7%
1.7%
16
IEEE/ACM Transactions on Computational Biology and Bioinformatics
32 papers in training set
Top 0.4%
1.0%
17
Advanced Science
249 papers in training set
Top 16%
0.9%
18
Genome Medicine
154 papers in training set
Top 7%
0.8%
19
Medical Image Analysis
33 papers in training set
Top 1.0%
0.8%
20
Frontiers in Molecular Biosciences
100 papers in training set
Top 4%
0.8%
21
Genes
126 papers in training set
Top 3%
0.8%
22
Communications Biology
886 papers in training set
Top 23%
0.8%
23
Nucleic Acids Research
1128 papers in training set
Top 19%
0.7%
24
GigaScience
172 papers in training set
Top 3%
0.7%
25
BioData Mining
15 papers in training set
Top 1%
0.7%
26
Journal of Computational Biology
37 papers in training set
Top 0.7%
0.7%
27
BMC Medical Genomics
36 papers in training set
Top 2%
0.5%
28
Frontiers in Cell and Developmental Biology
218 papers in training set
Top 11%
0.5%
29
The American Journal of Human Genetics
206 papers in training set
Top 5%
0.5%
30
Journal of Biomedical Informatics
45 papers in training set
Top 2%
0.5%