CDS-BART: A BART-Based Foundation Model for mRNA Sequence Analysis
Jadamba, E.; Lee, S.-H.; Hong, J.; Lee, H.; Lee, S.; Shin, H.
Show abstract
Summary: Recent advancements in artificial intelligence (AI) have led to the development of foundation models that interpret mRNA as a language. Notable examples include CodonBERT, hydraRNA, EVO2, and Helix-mRNA. These models demonstrate significant potential as powerful tools for mRNA research. However, to best of our knowledge, there is currently no publicly available AI model that is both easy to use and capable of analyzing mRNA sequences up to about 4kb, a length scale typical of many therapeutic mRNAs, including those encapsulated within lipid nanoparticls (LNPs). Thus, we propose CDS-BART, a user-friendly, open-source tool that integrates SentencePiece sub-word tokenization with the denoising sequence-to-sequence training of Bidirectional and Auto-Regressive Transformers (BART). CDS-BART was pre-trained on mRNA data from nine taxonomic groups provided by the NCBI RefSeq database. This comprehensive pre-training, coupled with BARTs denoising capability, ensures effective learning of codon usage, mRNA structure, evolution, and regulation. Thus, CDS-BART can ultimately deliver robust performance across a wide range of mRNA prediction tasks. Availability and ImplementationCDS-BART is released under the MIT License. Latest code is available via Github at https://github.com/mogam-ai/CDS-BART.
Matching journals
The top 3 journals account for 50% of the predicted probability mass.