Back

CDS-BART: A BART-Based Foundation Model for mRNA Sequence Analysis

Jadamba, E.; Lee, S.-H.; Hong, J.; Lee, H.; Lee, S.; Shin, H.

2026-03-11 bioinformatics
10.64898/2026.03.09.710670 bioRxiv
Show abstract

Summary: Recent advancements in artificial intelligence (AI) have led to the development of foundation models that interpret mRNA as a language. Notable examples include CodonBERT, hydraRNA, EVO2, and Helix-mRNA. These models demonstrate significant potential as powerful tools for mRNA research. However, to best of our knowledge, there is currently no publicly available AI model that is both easy to use and capable of analyzing mRNA sequences up to about 4kb, a length scale typical of many therapeutic mRNAs, including those encapsulated within lipid nanoparticls (LNPs). Thus, we propose CDS-BART, a user-friendly, open-source tool that integrates SentencePiece sub-word tokenization with the denoising sequence-to-sequence training of Bidirectional and Auto-Regressive Transformers (BART). CDS-BART was pre-trained on mRNA data from nine taxonomic groups provided by the NCBI RefSeq database. This comprehensive pre-training, coupled with BARTs denoising capability, ensures effective learning of codon usage, mRNA structure, evolution, and regulation. Thus, CDS-BART can ultimately deliver robust performance across a wide range of mRNA prediction tasks. Availability and ImplementationCDS-BART is released under the MIT License. Latest code is available via Github at https://github.com/mogam-ai/CDS-BART.

Matching journals

The top 3 journals account for 50% of the predicted probability mass.

1
Bioinformatics
1061 papers in training set
Top 0.8%
28.2%
2
Bioinformatics Advances
184 papers in training set
Top 0.1%
18.9%
3
Nucleic Acids Research
1128 papers in training set
Top 2%
8.4%
50% of probability mass above
4
Computational and Structural Biotechnology Journal
216 papers in training set
Top 0.4%
6.9%
5
PLOS Computational Biology
1633 papers in training set
Top 7%
4.9%
6
BMC Bioinformatics
383 papers in training set
Top 2%
4.2%
7
Briefings in Bioinformatics
326 papers in training set
Top 2%
3.7%
8
NAR Genomics and Bioinformatics
214 papers in training set
Top 1%
2.4%
9
Nature Communications
4913 papers in training set
Top 46%
2.1%
10
iScience
1063 papers in training set
Top 17%
1.5%
11
Nature Machine Intelligence
61 papers in training set
Top 3%
0.8%
12
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 42%
0.8%
13
Computers in Biology and Medicine
120 papers in training set
Top 4%
0.8%
14
ACS Synthetic Biology
256 papers in training set
Top 3%
0.8%
15
Frontiers in Molecular Biosciences
100 papers in training set
Top 5%
0.7%
16
Communications Biology
886 papers in training set
Top 25%
0.7%
17
GigaScience
172 papers in training set
Top 4%
0.7%
18
Frontiers in Immunology
586 papers in training set
Top 9%
0.7%
19
Genome Research
409 papers in training set
Top 5%
0.7%
20
Scientific Reports
3102 papers in training set
Top 78%
0.7%
21
ImmunoInformatics
11 papers in training set
Top 0.2%
0.7%
22
Patterns
70 papers in training set
Top 3%
0.7%
23
PLOS ONE
4510 papers in training set
Top 71%
0.7%
24
npj Systems Biology and Applications
99 papers in training set
Top 3%
0.7%
25
Proteins: Structure, Function, and Bioinformatics
82 papers in training set
Top 1%
0.5%
26
Advanced Science
249 papers in training set
Top 23%
0.5%
27
Journal of Molecular Biology
217 papers in training set
Top 5%
0.5%
28
Journal of Chemical Information and Modeling
207 papers in training set
Top 4%
0.5%