Back

Large mRNA language foundation modeling with NUWA for unified sequence perception and generation

Zhong, Y.; Yan, W.; Zhang, Y.; Tan, K.; Bian, B.

2025-11-02 bioinformatics
10.1101/2025.11.01.686058 bioRxiv
Show abstract

The mRNA serves as a crucial bridge between DNA and proteins. Compared to DNA, mRNA sequences are much more concise and information-dense, which makes mRNA an ideal language through which to explore various biological principles. In this study, we present NUWA, a large mRNA language foundation model leveraging a BERT-like architecture, trained with curriculum masked language modeling and supervised contrastive loss for unified mRNA sequence perception and generation. For pretraining, we utilized large-scale mRNA coding sequences comprising approximately 80 million sequences from 19,676 bacterial species, 33 million from 4,688 eukaryotic species, and 2.1 million from 702 archaeal species, and pre-trained three domain-specific models respectively. This enables NUWA to learn coding sequence patterns across the entire tree of life. The fine-tuned NUWA demonstrates strong performance across a variety of downstream tasks, excelling not only in RNA-related perception tasks but also exhibiting robust capability in cross-modal protein-related tasks. On the generation front, NUWA pioneers an entropy-guided strategy that enables BERT-like models in generating mRNA sequences, producing natural-like sequences that accurately recapitulate species-specific codon usage patterns. Moreover, NUWA can be effectively fine-tuned on small, task-specific datasets to generate functional mRNAs with desired properties, including sequences that do not exist in nature, and to design coding sequences for diverse proteins in biomanufacturing, vaccine development, and therapeutic applications. To our knowledge, NUWA represents the first mRNA language model for unified sequence perception and generation, providing a versatile and programmable platform for mRNA design.

Matching journals

The top 4 journals account for 50% of the predicted probability mass.

1
Nature Machine Intelligence
61 papers in training set
Top 0.1%
22.6%
2
Nature Communications
4913 papers in training set
Top 10%
14.4%
3
Advanced Science
249 papers in training set
Top 2%
8.5%
4
Nucleic Acids Research
1128 papers in training set
Top 2%
8.4%
50% of probability mass above
5
Cell Systems
167 papers in training set
Top 3%
4.3%
6
Briefings in Bioinformatics
326 papers in training set
Top 2%
3.6%
7
Nature Biotechnology
147 papers in training set
Top 4%
2.1%
8
Cell Genomics
162 papers in training set
Top 2%
2.1%
9
Genome Biology
555 papers in training set
Top 3%
2.1%
10
Nature Methods
336 papers in training set
Top 4%
1.7%
11
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 35%
1.5%
12
Bioinformatics
1061 papers in training set
Top 8%
1.5%
13
Cell Research
49 papers in training set
Top 1%
1.5%
14
Computational and Structural Biotechnology Journal
216 papers in training set
Top 6%
1.2%
15
PLOS Computational Biology
1633 papers in training set
Top 20%
1.2%
16
Genomics, Proteomics & Bioinformatics
171 papers in training set
Top 4%
1.2%
17
Science
429 papers in training set
Top 17%
1.0%
18
Genome Medicine
154 papers in training set
Top 6%
1.0%
19
National Science Review
22 papers in training set
Top 2%
0.9%
20
Communications Biology
886 papers in training set
Top 18%
0.9%
21
Nature
575 papers in training set
Top 16%
0.7%
22
Cell Discovery
54 papers in training set
Top 5%
0.7%
23
PLOS ONE
4510 papers in training set
Top 69%
0.7%
24
Scientific Reports
3102 papers in training set
Top 76%
0.7%
25
Cell Reports Medicine
140 papers in training set
Top 9%
0.6%
26
iScience
1063 papers in training set
Top 37%
0.6%
27
Protein & Cell
25 papers in training set
Top 3%
0.6%
28
Genome Research
409 papers in training set
Top 5%
0.5%
29
Nature Computational Science
50 papers in training set
Top 2%
0.5%