Back

MsPBRsP: Multi-scale Protein Binding Residues Prediction Using Language Model

Li, Y.; Lu, S.; Nan, X.; Zhang, S.; Zhou, Q.

2023-02-27 bioinformatics
10.1101/2023.02.26.528265 bioRxiv
Show abstract

Accurate prediction of protein binding residues (PBRs) from sequence is important for the understanding of cellular activity and helpful for the design of novel drug. However, experimental methods are time-consuming and expensive. In recent years, a lot of computational predictors based on machine learning and deep learning models are proposed to reduce such consumption. But those methods often use MSA tools such as PSI-BLAST or NetSurfP to generate some statistical features and enter them into predictive models as necessary supplementary input. The input generation process normally takes long time, and there is no standard to specify which and how many statistic results should be provided to a prediction model. In addition, prediction of PBRs relies on residue local context, but the most appropriate scale is undetermined. Most works pre-selected certain residue features as input and a scale size based on expertise for certain type of PBRs. In this study, we propose a general tool-free end-to-end framework that can be applied to all types of PBRs, Multi-scale Protein Binding Residues Prediction using language model (MsPBRsP). We adopt a pre-trained language model ProtTrans to save the large consumption caused by MSA tools, and use protein sequence alone as input to our model. To ease scale size uncertainty, we construct multi-size windows in attention layer and multi-size kernels in convolutional layer. We test our framework on various benchmark datasets including PBRs from protein-protein, protein-nucleotide, protein-small ligand, heterodimer, homodimer and antibody-antigen interactions. Compared with existing state-of-the-art methods, MsPBRsP achieves superior performance with less running time and higher prediction rates on every PBRs prediction task. Specifically, we boost F1 score by 27.1% and AUPRC score by 7.6% on NSP448 dataset and decrease running time from over 10 minutes to under 0.1s on average. The source code and datasets are available at https://github.com/biolushuai/MsPBRsP-for-multiple-PBRs-prediction.

Matching journals

The top 3 journals account for 50% of the predicted probability mass.

1
Briefings in Bioinformatics
326 papers in training set
Top 0.1%
27.5%
2
Bioinformatics
1061 papers in training set
Top 1%
22.3%
3
Genomics, Proteomics & Bioinformatics
171 papers in training set
Top 0.7%
8.3%
50% of probability mass above
4
Nature Communications
4913 papers in training set
Top 40%
3.6%
5
BMC Bioinformatics
383 papers in training set
Top 3%
3.6%
6
Journal of Molecular Biology
217 papers in training set
Top 1%
2.1%
7
PLOS Computational Biology
1633 papers in training set
Top 14%
2.1%
8
IEEE/ACM Transactions on Computational Biology and Bioinformatics
32 papers in training set
Top 0.2%
1.9%
9
Journal of Cheminformatics
25 papers in training set
Top 0.3%
1.7%
10
IEEE Transactions on Computational Biology and Bioinformatics
17 papers in training set
Top 0.2%
1.7%
11
Nature Machine Intelligence
61 papers in training set
Top 2%
1.5%
12
Bioinformatics Advances
184 papers in training set
Top 3%
1.3%
13
Computational and Structural Biotechnology Journal
216 papers in training set
Top 6%
1.2%
14
Nucleic Acids Research
1128 papers in training set
Top 14%
1.2%
15
Scientific Reports
3102 papers in training set
Top 69%
0.9%
16
Quantitative Biology
11 papers in training set
Top 0.6%
0.9%
17
Journal of Chemical Information and Modeling
207 papers in training set
Top 3%
0.7%
18
National Science Review
22 papers in training set
Top 2%
0.7%
19
Journal of Genetics and Genomics
36 papers in training set
Top 2%
0.7%
20
Patterns
70 papers in training set
Top 3%
0.7%
21
PLOS ONE
4510 papers in training set
Top 68%
0.7%
22
Science Bulletin
22 papers in training set
Top 1%
0.6%
23
eLife
5422 papers in training set
Top 62%
0.6%
24
IEEE Journal of Biomedical and Health Informatics
34 papers in training set
Top 2%
0.6%
25
Communications Biology
886 papers in training set
Top 29%
0.6%