Designing mRNA coding sequence via multimodal reverse translation language modeling with Pro2RNA
Bian, B.; Zhang, Y.; Zhang, J.; Asai, K.; Saito, Y.
Show abstract
mRNA coding sequence design is a critical component in the development of mRNA vaccines, nucleic acid therapeutics, and heterologous gene expression systems. While large language models have recently been successfully applied to protein design and RNA modeling, designing optimal mRNA coding sequences for a given protein, particularly in a species-specific manner, remains a major challenge. Here, we present Pro2RNA, a multimodal reverse-translation language model that generates mRNA coding sequences from their corresponding protein sequences while explicitly conditioning on host organism taxonomy information. Pro2RNA integrates multiple pretrained language models across different modalities, including ESM2 for protein representation, SciBERT for taxonomy understanding, and a generative RNA language model for mRNA codon-level sequence generation. By training on mRNA-protein pairs from eukaryote and bacteria datasets respectively, Pro2RNA learns species-dependent genetic codes and codon usage patterns, enabling the generation of host-adapted and natural-like mRNA coding sequences. Across multiple benchmark evaluations, Pro2RNA matches or surpasses existing optimization methods, demonstrating its potential as a powerful and flexible framework for species-aware mRNA coding sequence design.
Matching journals
The top 6 journals account for 50% of the predicted probability mass.