Back

Protein Function Prediction with Pretrained ProtT5 Embeddings and Gradient Boosting

Appel, J.; Butcher, N.

2026-04-28 bioinformatics
10.64898/2026.04.27.721184 bioRxiv
Show abstract

Protein function prediction remains a central challenge in computational biology due to the extreme sparsity and long-tail distribution of Gene Ontology (GO) [1] annotations. Advances in protein language models enable the extraction of dense, fixed-length representations from amino acid sequences, offering a scalable alternative to hand-picked features such as physicochemical properties. In this work, we evaluate a transformer-based embedding approach using ProtT5-XL combined with classical and modern multi-label classifiers for Gene Ontology prediction in the CAFA-6 setting. Fixed-length embeddings were generated via mean pooling of transformer hidden states and used as input to one-vs-rest logistic regression, gradient-boosted decision trees, and a neural network. Models were evaluated on held-out validation data with a focus on threshold selection, prediction sparsity, and behavior across frequent and rare GO terms. Gradient boosting consistently provided the best balance between predictive performance and stable prediction behavior, motivating its use for ontology-specific predictors across molecular function, biological process, and cellular component annotations. This study highlights practical modeling choices for large-scale protein function prediction using pretrained sequence embeddings and provides an interpretable baseline for future CAFA evaluations.

Matching journals

The top 3 journals account for 50% of the predicted probability mass.

1
Bioinformatics
1061 papers in training set
Top 0.7%
31.7%
2
BMC Bioinformatics
383 papers in training set
Top 0.3%
17.9%
3
Bioinformatics Advances
184 papers in training set
Top 0.3%
8.1%
50% of probability mass above
4
NAR Genomics and Bioinformatics
214 papers in training set
Top 0.1%
8.1%
5
Briefings in Bioinformatics
326 papers in training set
Top 2%
3.5%
6
PLOS Computational Biology
1633 papers in training set
Top 11%
3.5%
7
Nucleic Acids Research
1128 papers in training set
Top 8%
2.3%
8
Scientific Reports
3102 papers in training set
Top 54%
1.8%
9
Cell Systems
167 papers in training set
Top 8%
1.6%
10
Computational and Structural Biotechnology Journal
216 papers in training set
Top 5%
1.6%
11
Nature Machine Intelligence
61 papers in training set
Top 3%
1.2%
12
Nature Communications
4913 papers in training set
Top 58%
1.1%
13
Nature Methods
336 papers in training set
Top 6%
0.9%
14
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 41%
0.9%
15
Frontiers in Genetics
197 papers in training set
Top 8%
0.9%
16
Journal of Chemical Information and Modeling
207 papers in training set
Top 3%
0.7%
17
GigaScience
172 papers in training set
Top 3%
0.7%
18
npj Systems Biology and Applications
99 papers in training set
Top 3%
0.7%
19
Advanced Science
249 papers in training set
Top 20%
0.7%
20
Genome Medicine
154 papers in training set
Top 9%
0.7%
21
International Journal of Molecular Sciences
453 papers in training set
Top 17%
0.7%
22
Patterns
70 papers in training set
Top 3%
0.6%