Back

Protein language model-based end-to-end type II polyketide prediction without sequence alignment

Qin, Z.; Zhang, H.; Huang, J.; Gao, Q.; Tang, Y.; Wu, Y.

2023-04-20 evolutionary biology
10.1101/2023.04.18.537339 bioRxiv
Show abstract

Natural products are important sources for drug development, and the precise prediction of their structures assembled by modular proteins is an area of great interest. In this study, we introduce DeepT2, an end-to-end, cost-effective, and accurate machine learning platform to accelerate the identification of type II polyketides (T2PKs), which represent a significant portion of the natural product world. Our algorithm is based on advanced natural language processing models and utilizes the core biosynthetic enzyme, chain length factor (CLF or KS{beta}), as computing inputs. The process involves sequence embedding, data labeling, classifier development, and novelty detection, which enable precise classification and prediction directly from KS{beta} without sequence alignments. Combined with metagenomics and metabolomics, we evaluated the ability of DeepT2 and found this model could easily detect and classify KS{beta} either as a single sequence or a mixture of bacterial genomes, and subsequently identify the corresponding T2PKs in a labeled categorized class or as novel. Our work highlights deep learning as a promising framework for genome mining and therefore provides a meaningful platform for discovering medically important natural products.

Matching journals

The top 12 journals account for 50% of the predicted probability mass.

1
Quantitative Biology
11 papers in training set
Top 0.1%
7.2%
2
Briefings in Bioinformatics
326 papers in training set
Top 0.7%
6.8%
3
Communications Chemistry
39 papers in training set
Top 0.1%
6.4%
4
PLOS ONE
4510 papers in training set
Top 32%
4.9%
5
PLOS Computational Biology
1633 papers in training set
Top 8%
4.2%
6
Genomics, Proteomics & Bioinformatics
171 papers in training set
Top 2%
4.0%
7
Advanced Science
249 papers in training set
Top 5%
3.6%
8
Science China Life Sciences
26 papers in training set
Top 0.4%
3.6%
9
Scientific Reports
3102 papers in training set
Top 40%
3.3%
10
Journal of Chemical Information and Modeling
207 papers in training set
Top 1%
2.7%
11
Science Bulletin
22 papers in training set
Top 0.2%
2.6%
12
Computers in Biology and Medicine
120 papers in training set
Top 1%
2.6%
50% of probability mass above
13
Computational and Structural Biotechnology Journal
216 papers in training set
Top 3%
2.1%
14
Patterns
70 papers in training set
Top 0.5%
2.1%
15
National Science Review
22 papers in training set
Top 0.7%
1.9%
16
Bioinformatics
1061 papers in training set
Top 7%
1.9%
17
eLife
5422 papers in training set
Top 40%
1.8%
18
ACS Pharmacology & Translational Science
40 papers in training set
Top 0.4%
1.7%
19
Frontiers in Microbiology
375 papers in training set
Top 6%
1.5%
20
iScience
1063 papers in training set
Top 18%
1.5%
21
Plant Communications
35 papers in training set
Top 1.0%
1.3%
22
Synthetic and Systems Biotechnology
10 papers in training set
Top 0.3%
1.3%
23
Journal of Genetics and Genomics
36 papers in training set
Top 1%
1.3%
24
International Journal of Biological Macromolecules
65 papers in training set
Top 2%
1.2%
25
Frontiers in Cell and Developmental Biology
218 papers in training set
Top 6%
1.2%
26
Genes
126 papers in training set
Top 2%
1.1%
27
Communications Biology
886 papers in training set
Top 17%
1.0%
28
Acta Pharmaceutica Sinica B
11 papers in training set
Top 0.7%
0.9%
29
Biochimie
23 papers in training set
Top 0.4%
0.7%
30
ACS Synthetic Biology
256 papers in training set
Top 3%
0.7%