Protein language model-based end-to-end type II polyketide prediction without sequence alignment

Qin, Z.; Zhang, H.; Huang, J.; Gao, Q.; Tang, Y.; Wu, Y.

2023-04-20 evolutionary biology

10.1101/2023.04.18.537339 bioRxiv

Show abstract

Natural products are important sources for drug development, and the precise prediction of their structures assembled by modular proteins is an area of great interest. In this study, we introduce DeepT2, an end-to-end, cost-effective, and accurate machine learning platform to accelerate the identification of type II polyketides (T2PKs), which represent a significant portion of the natural product world. Our algorithm is based on advanced natural language processing models and utilizes the core biosynthetic enzyme, chain length factor (CLF or KS{beta}), as computing inputs. The process involves sequence embedding, data labeling, classifier development, and novelty detection, which enable precise classification and prediction directly from KS{beta} without sequence alignments. Combined with metagenomics and metabolomics, we evaluated the ability of DeepT2 and found this model could easily detect and classify KS{beta} either as a single sequence or a mixture of bacterial genomes, and subsequently identify the corresponding T2PKs in a labeled categorized class or as novel. Our work highlights deep learning as a promising framework for genome mining and therefore provides a meaningful platform for discovering medically important natural products.

Matching journals

●Non-profit ◐University press ○Commercial

The top 12 journals account for 50% of the predicted probability mass.

Only show non-profit

Quantitative Biology

○ 11 papers in training set

Briefings in Bioinformatics

◐ 326 papers in training set

Communications Chemistry

○ 39 papers in training set

● 4510 papers in training set

PLOS Computational Biology

● 1633 papers in training set

Genomics, Proteomics & Bioinformatics

◐ 171 papers in training set

Advanced Science

○ 249 papers in training set

Science China Life Sciences

○ 26 papers in training set

Scientific Reports

○ 3102 papers in training set

Journal of Chemical Information and Modeling

● 207 papers in training set

Science Bulletin

○ 22 papers in training set

Computers in Biology and Medicine

○ 120 papers in training set

50% of probability mass above

Computational and Structural Biotechnology Journal

● 216 papers in training set

○ 70 papers in training set

National Science Review

◐ 22 papers in training set

◐ 1061 papers in training set

● 5422 papers in training set

ACS Pharmacology & Translational Science

● 40 papers in training set

Frontiers in Microbiology

○ 375 papers in training set

○ 1063 papers in training set

Plant Communications

○ 35 papers in training set

Synthetic and Systems Biotechnology

○ 10 papers in training set

Journal of Genetics and Genomics

○ 36 papers in training set

International Journal of Biological Macromolecules

○ 65 papers in training set

Frontiers in Cell and Developmental Biology

○ 218 papers in training set

○ 126 papers in training set

Communications Biology

○ 886 papers in training set

Acta Pharmaceutica Sinica B

○ 11 papers in training set

○ 23 papers in training set

ACS Synthetic Biology

● 256 papers in training set