Back

A Machine Learning Approach for Physiological Role Prediction in Protein Contact Networks: a large-scale analysis on the human proteome

Cervellini, M.; Martino, A.

2026-04-14 bioinformatics
10.64898/2026.04.10.717657 bioRxiv
Show abstract

Proteins are fundamental macromolecules involved in virtually all biological processes. Their physiological roles are tightly linked to their three-dimensional structure, which can be naturally abstracted as Protein Contact Networks (PCNs), i.e., graphs where residues are nodes and edges encode spatial proximity. This representation enables the application of Graph Machine Learning to address the protein functional annotation gap at proteome scale. In this work, protein function prediction is studied on the majority of the human proteome, focusing on enzymatic activity and enzyme class assignment as well-defined and biologically meaningful targets. A large-scale supervised analysis was conducted on PCNs derived from experimentally resolved human protein structures. Multiple graph-based learning paradigms were systematically compared under a unified evaluation protocol, including handcrafted graph embeddings, kernel methods, and end-to-end Graph Neural Networks (GNNs). Feature engineering approaches comprised (i) spectral density embeddings of the normalized graph Laplacian and (ii) higher-order topological representations based on simplicial complexes, with optional INDVAL-based feature selection. These representations were paired with linear, ensemble, and kernel classifiers, while GNNs were trained directly on raw PCNs exploiting a diverse set of message-passing architectures. Two tasks were considered: binary classification of enzymatic versus non-enzymatic proteins and multiclass prediction of first-level Enzyme Commission (EC) classes. Performance was assessed using repeated stratified splits to ensure robust and variance-aware evaluation. In the binary enzymatic classification task, the Jaccard-based graph kernel achieved the best performance with an adjusted balanced accuracy of 0.90, closely followed by GNNs trained end-to-end on PCNs. In the multiclass EC prediction task, GNNs demonstrated superior discriminative power, reaching an adjusted balanced accuracy of 0.92 and outperforming all explicit embedding and kernel-based approaches. Overall, results indicate that EC class prediction is intrinsically more complex than binary enzymatic discrimination and benefits from the higher expressivity of deep message-passing architectures. The findings demonstrate that graph-based representations of protein structure support competitive functional prediction at proteome scale, with classical kernel methods and modern GNNs offering complementary strengths in terms of accuracy, scalability, and flexibility.

Matching journals

The top 8 journals account for 50% of the predicted probability mass.

1
Advanced Science
249 papers in training set
Top 0.3%
18.6%
2
Nature Communications
4913 papers in training set
Top 28%
6.4%
3
Scientific Reports
3102 papers in training set
Top 24%
4.8%
4
Bioinformatics
1061 papers in training set
Top 4%
4.8%
5
Briefings in Bioinformatics
326 papers in training set
Top 1%
4.8%
6
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 20%
3.6%
7
Journal of Proteome Research
215 papers in training set
Top 0.8%
3.6%
8
Computational and Structural Biotechnology Journal
216 papers in training set
Top 2%
3.6%
50% of probability mass above
9
Journal of Chemical Information and Modeling
207 papers in training set
Top 1%
3.1%
10
NAR Genomics and Bioinformatics
214 papers in training set
Top 1%
2.6%
11
Patterns
70 papers in training set
Top 0.4%
2.6%
12
PLOS Computational Biology
1633 papers in training set
Top 12%
2.6%
13
Molecular & Cellular Proteomics
158 papers in training set
Top 0.8%
2.4%
14
Nucleic Acids Research
1128 papers in training set
Top 9%
2.1%
15
Nature Machine Intelligence
61 papers in training set
Top 2%
1.9%
16
BMC Bioinformatics
383 papers in training set
Top 4%
1.9%
17
Journal of Molecular Biology
217 papers in training set
Top 1%
1.8%
18
Communications Biology
886 papers in training set
Top 9%
1.7%
19
Bioinformatics Advances
184 papers in training set
Top 3%
1.7%
20
Cell Systems
167 papers in training set
Top 9%
1.2%
21
eLife
5422 papers in training set
Top 49%
1.2%
22
IEEE Journal of Biomedical and Health Informatics
34 papers in training set
Top 2%
1.1%
23
Protein Science
221 papers in training set
Top 1%
0.9%
24
Frontiers in Genetics
197 papers in training set
Top 8%
0.9%
25
International Journal of Molecular Sciences
453 papers in training set
Top 16%
0.7%
26
Frontiers in Immunology
586 papers in training set
Top 8%
0.7%
27
Nature Methods
336 papers in training set
Top 6%
0.7%
28
iScience
1063 papers in training set
Top 35%
0.7%
29
Biophysical Journal
545 papers in training set
Top 6%
0.6%
30
GigaScience
172 papers in training set
Top 4%
0.6%