Back

USPNet: unbiased organism-agnostic signal peptidepredictor with deep protein language model

Chen, S.; Tan, Q.; Li, J.; Li, Y.

2021-11-05 bioinformatics
10.1101/2021.11.04.467361 bioRxiv
Show abstract

Signal peptide is a short peptide located in the N-terminus of proteins. It plays an important role in targeting and transferring transmembrane proteins and secreted proteins to correct positions. Compared with traditional experimental methods to identify and discover signal peptides, the computational methods are faster and more efficient, which are more practical for the analysis of thousands or even millions of protein sequences in reality, especially for the metagenomic data. Therefore, computational tools are recently proposed to classify signal peptides and predict cleavage site positions, but most of them disregard the extreme data imbalance problem in these tasks. In addition, almost all these methods rely on additional group information of proteins to boost their performances, which, however, may not always be available. To deal with these issues, in this paper, we present Unbiased Organism-agnostic Signal Peptide Network (USPNet), a signal peptide prediction and cleavage site prediction model based on deep protein language model. We propose to use label distribution-aware margin (LDAM) loss and evolutionary scale modeling (ESM) embedding to handle data imbalance and object-dependence problems. Extensive experimental results demonstrate that the proposed method significantly outperforms all the previous methods on the classification performance. Additional study on the simulated metagenomic data further indicates that our model is a more universal and robust tool without dependency on additional group information of proteins, with the Matthews correlation coefficient improved by up to 17.5%. The proposed method will be potentially useful to discover new signal peptides from the abundant metagenomic data.

Matching journals

The top 3 journals account for 50% of the predicted probability mass.

1
Briefings in Bioinformatics
326 papers in training set
Top 0.1%
29.2%
2
Bioinformatics
1061 papers in training set
Top 2%
15.1%
3
BMC Bioinformatics
383 papers in training set
Top 1%
6.7%
50% of probability mass above
4
PLOS Computational Biology
1633 papers in training set
Top 6%
5.1%
5
Genomics, Proteomics & Bioinformatics
171 papers in training set
Top 1%
4.5%
6
PLOS ONE
4510 papers in training set
Top 37%
3.8%
7
Computational and Structural Biotechnology Journal
216 papers in training set
Top 2%
2.7%
8
Scientific Reports
3102 papers in training set
Top 48%
2.2%
9
IEEE Journal of Biomedical and Health Informatics
34 papers in training set
Top 0.7%
2.2%
10
IEEE/ACM Transactions on Computational Biology and Bioinformatics
32 papers in training set
Top 0.2%
1.8%
11
Journal of Computational Biology
37 papers in training set
Top 0.2%
1.8%
12
Computers in Biology and Medicine
120 papers in training set
Top 3%
1.2%
13
Bioinformatics Advances
184 papers in training set
Top 4%
0.9%
14
Nature Machine Intelligence
61 papers in training set
Top 3%
0.8%
15
Expert Systems with Applications
11 papers in training set
Top 0.3%
0.8%
16
Neurocomputing
13 papers in training set
Top 0.5%
0.8%
17
Quantitative Biology
11 papers in training set
Top 0.6%
0.8%
18
Journal of Proteome Research
215 papers in training set
Top 2%
0.8%
19
IEEE Transactions on Computational Biology and Bioinformatics
17 papers in training set
Top 0.8%
0.7%
20
Nucleic Acids Research
1128 papers in training set
Top 19%
0.7%
21
Frontiers in Genetics
197 papers in training set
Top 12%
0.5%
22
Journal of Genetics and Genomics
36 papers in training set
Top 3%
0.5%