Back

Uncertainty-aware graph representation learning with positive-unlabeled classification for biomarker discovery in peripheral artery disease

Ayyalasomayajula, V. S. R. K.; Senders, M. L.; Wolterink, J. M.; Yeung, K. K.

2026-05-13 systems biology
10.64898/2026.05.08.723757 bioRxiv
Show abstract

Peripheral artery disease (PAD) is a complex vascular disorder characterized by heterogeneous molecular mechanisms and incomplete functional annotation, limiting systematic biomarker discovery. Network-based learning approaches provide a powerful framework for disease gene prioritization; however, most existing methods produce overconfident predictions without explicitly accounting for model uncertainty or structural novelty. Here, we present an uncertainty-aware framework for PAD biomarker discovery that integrates unsupervised graph representation learning, positive-unlabeled (PU) classification, ensemble prediction, and mechanistic explainability. Node embeddings were learned using multiple unsupervised graph neural network (GNN) objectives and combined with heterogeneous classifiers to generate ensemble-averaged probability estimates and epistemic uncertainty. By jointly modeling predictive confidence and embedding-space novelty, we stratified candidates into high-confidence rediscoveries and structurally novel hypotheses under explicit uncertainty control. Across eight embedding objectives and five classifiers, ensemble aggregation produced stable, well-calibrated predictions and enabled prioritization of 100 candidate PAD-associated proteins. Probability-heavy candidates clustered tightly with known PAD proteins and were enriched for established vascular and hemostatic pathways, including extracellular matrix organization, integrin signaling, coagulation, and fibrinolysis. In contrast, novelty-heavy candidates occupied distinct embedding-space regions and partitioned into multiple coherent clusters enriched for upstream regulatory and signaling processes, including G protein-coupled receptor, ephrin receptor, kinase-driven, and NF-{kappa}B-associated pathways. Five-fold cross-validated comparison with established PU learning baselines demonstrated consistent improvement across all evaluation metrics (AUC 0.916 {+/-} 0.019 vs. 0.821 {+/-} 0.030 for the best baseline), and external validity was confirmed by significant enrichment of top candidates for related cardiovascular disease annotations (5.7x above background). Together, these results demonstrate that integrating uncertainty, novelty, and explainability enables calibrated and biologically grounded biomarker prioritization, with broad applicability to PAD and other complex diseases. Author summaryPeripheral artery disease affects millions of people worldwide but remains underdiagnosed, partly because we lack reliable molecular markers to detect it early. In this study, we developed a computational framework that uses protein interaction network data to predict which proteins may be involved in PAD, even when we only know a small number of confirmed disease-associated proteins. Our approach combines graph neural network embeddings with a machine learning technique called positive-unlabeled learning, which is specifically designed for situations where you have confirmed positives but no confirmed negatives. We also quantify how confident the model is in each prediction and identify candidates that are genuinely novel compared to what is already known. Tested against established methods, our framework consistently found more known disease proteins in cross-validated evaluation. The candidates we identified map to biologically coherent pathways relevant to vascular disease, and our top predictions are enriched for proteins associated with related cardiovascular conditions, providing external validation. This work provides a principled and transparent approach to biomarker discovery that could be applied to other complex diseases with limited molecular annotations.

Matching journals

The top 5 journals account for 50% of the predicted probability mass.

1
Bioinformatics Advances
184 papers in training set
Top 0.1%
14.3%
2
Circulation: Genomic and Precision Medicine
42 papers in training set
Top 0.1%
12.4%
3
PLOS Computational Biology
1633 papers in training set
Top 3%
10.4%
4
Bioinformatics
1061 papers in training set
Top 3%
10.4%
5
npj Digital Medicine
97 papers in training set
Top 0.7%
6.8%
50% of probability mass above
6
Scientific Reports
3102 papers in training set
Top 19%
6.3%
7
Patterns
70 papers in training set
Top 0.1%
4.8%
8
Frontiers in Genetics
197 papers in training set
Top 2%
3.6%
9
Computational and Structural Biotechnology Journal
216 papers in training set
Top 3%
2.4%
10
Journal of Biomedical Informatics
45 papers in training set
Top 0.7%
2.1%
11
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 30%
1.9%
12
npj Systems Biology and Applications
99 papers in training set
Top 1%
1.7%
13
iScience
1063 papers in training set
Top 16%
1.7%
14
PLOS ONE
4510 papers in training set
Top 55%
1.7%
15
BMC Bioinformatics
383 papers in training set
Top 5%
1.3%
16
BMC Medical Genomics
36 papers in training set
Top 0.6%
1.3%
17
Communications Biology
886 papers in training set
Top 14%
1.2%
18
Briefings in Bioinformatics
326 papers in training set
Top 6%
0.9%
19
Communications Medicine
85 papers in training set
Top 0.8%
0.9%
20
IEEE/ACM Transactions on Computational Biology and Bioinformatics
32 papers in training set
Top 0.5%
0.8%
21
Nature Communications
4913 papers in training set
Top 66%
0.6%
22
BMC Medical Informatics and Decision Making
39 papers in training set
Top 3%
0.6%
23
Cell Reports Medicine
140 papers in training set
Top 10%
0.6%