Back

Protein Function Prediction via Contig-Aware Multi-Level Feature Integration

Yang, L.; Du, K.; Lu, Y.; Wang, M.; Zhang, H.; Yang, S.; Lin, Y.; Zhuo, J.; Zhang, D.; Jiang, Y.; Zhang, X.; Li, S.

2025-08-11 bioinformatics
10.1101/2025.08.07.669053 bioRxiv
Show abstract

Proteins play a central role in biological processes, and accurately predicting their functions is crucial for biomedical research. While computational methods have advanced significantly, most approaches rely solely on sequence or structure, neglecting critical inter-protein relationships, such as the topological arrangement of coding sequences (CDSs) within contigs. To address this gap, we propose CAML, a novel deep learning model that integrates intra-protein features including sequence and predicted structure with inter-protein features capturing functional linkages among CDSs in contigs. Specifically, CAML employs a Graph Isomorphism Network (GIN) to extract structural features from predicted protein contact graphs and ESM-2 for sequence embeddings. Additionally, it leverages kmer frequencies and a Bidirectional Long Short-Term Memory (BiLSTM) network to model functional relationships among colocalized CDSs within contigs, capturing operon-like associations. Extensive experiments demonstrate that CAML outperforms the state-of-the-art methods in accuracy, precision, recall and F1-score, achieving improvements of 11.24%, 12.43%, 13.59%, and 13.30%, respectively over the second-best model. Ablation studies further confirm the critical contribution of CAMLs multi-level biological feature integration in enhancing functional annotation accuracy. Our study demonstrates the importance of integrating structural, sequential, and CDSs topological features for accurate protein function prediction, providing a robust computational framework for genomics research.

Matching journals

The top 5 journals account for 50% of the predicted probability mass.

1
Bioinformatics
1061 papers in training set
Top 2%
17.1%
2
Briefings in Bioinformatics
326 papers in training set
Top 0.2%
14.0%
3
BMC Bioinformatics
383 papers in training set
Top 1%
8.9%
4
IEEE Transactions on Computational Biology and Bioinformatics
17 papers in training set
Top 0.1%
8.0%
5
Computational and Structural Biotechnology Journal
216 papers in training set
Top 0.9%
4.7%
50% of probability mass above
6
Genomics, Proteomics & Bioinformatics
171 papers in training set
Top 1%
4.7%
7
Advanced Science
249 papers in training set
Top 4%
4.2%
8
IEEE Journal of Biomedical and Health Informatics
34 papers in training set
Top 0.6%
3.0%
9
PLOS Computational Biology
1633 papers in training set
Top 12%
2.8%
10
NAR Genomics and Bioinformatics
214 papers in training set
Top 1%
2.7%
11
Nature Communications
4913 papers in training set
Top 47%
2.0%
12
Bioinformatics Advances
184 papers in training set
Top 3%
1.8%
13
IEEE/ACM Transactions on Computational Biology and Bioinformatics
32 papers in training set
Top 0.2%
1.7%
14
Nucleic Acids Research
1128 papers in training set
Top 11%
1.7%
15
Journal of Genetics and Genomics
36 papers in training set
Top 1%
1.3%
16
PLOS ONE
4510 papers in training set
Top 61%
1.2%
17
Scientific Reports
3102 papers in training set
Top 72%
0.9%
18
Frontiers in Genetics
197 papers in training set
Top 8%
0.9%
19
Genome Biology
555 papers in training set
Top 7%
0.8%
20
Database
51 papers in training set
Top 0.9%
0.8%
21
Genome Research
409 papers in training set
Top 4%
0.7%
22
Nature Machine Intelligence
61 papers in training set
Top 4%
0.6%
23
Plant Communications
35 papers in training set
Top 2%
0.6%