Back

RNALens: Study on 5' UTR Modeling and Cell-Specificity

Mao, L.; Tian, Y.; Qian, K.-w.; Song, Y.

2025-07-20 bioinformatics
10.1101/2025.07.20.665722 bioRxiv
Show abstract

Recently, the Transformer architecture has been applied to predict the structure, function, and regulatory activity of biological sequences. Predicting the cell-specific regulatory impact of 5 untranslated regions (5 UTRs) on mRNA expression and translation remains a key challenge for rational mRNA design. Existing studies such as UTR-LM, RNABERT, and RNA-FM train transformer-based models solely on 5 UTR sequences with fixed nucleotide tokenization schemes and auxiliary structural features. These models pay less attention to the integration of broader genomic context and thermodynamic objectives, which limits their ability to generalize across diverse cell types and accurately predict both mRNA expression level (EL) and translation efficiency (TE). In this paper, we propose RNALens, a foundation model pre-trained in two stages on multispecies genomic sequences and curated 5 UTR data using masked language modeling augmented with secondary structure prediction and minimum free energy regression. RNALens employs byte-pair encoding to capture variable-length nucleotide motifs. It is then fine-tuned on high-throughput reporter assay datasets from HEK293T, PC3, and muscle tissues to yield specialized predictors for EL and TE in each cellular context. Experiment results on benchmark datasets demonstrate that RNALens achieves superior performance than existing machine learning methods for both expression and translation predictions across cell-specific and cross-context tests, offering an efficient in silico platform for guiding the design of mRNA therapeutics with precise cellular targeting.1

Matching journals

The top 4 journals account for 50% of the predicted probability mass.

1
Briefings in Bioinformatics
326 papers in training set
Top 0.1%
26.8%
2
Bioinformatics
1061 papers in training set
Top 2%
12.9%
3
Computational and Structural Biotechnology Journal
216 papers in training set
Top 0.5%
6.6%
4
BMC Bioinformatics
383 papers in training set
Top 2%
5.0%
50% of probability mass above
5
Bioinformatics Advances
184 papers in training set
Top 1%
3.7%
6
Genomics, Proteomics & Bioinformatics
171 papers in training set
Top 2%
3.2%
7
Nucleic Acids Research
1128 papers in training set
Top 7%
2.7%
8
Nature Machine Intelligence
61 papers in training set
Top 1%
2.5%
9
IEEE/ACM Transactions on Computational Biology and Bioinformatics
32 papers in training set
Top 0.1%
2.1%
10
PLOS Computational Biology
1633 papers in training set
Top 14%
2.0%
11
IEEE Journal of Biomedical and Health Informatics
34 papers in training set
Top 0.9%
1.8%
12
PLOS ONE
4510 papers in training set
Top 52%
1.8%
13
Frontiers in Genetics
197 papers in training set
Top 4%
1.8%
14
Computers in Biology and Medicine
120 papers in training set
Top 2%
1.5%
15
Journal of Chemical Information and Modeling
207 papers in training set
Top 2%
1.5%
16
Molecular Therapy Nucleic Acids
32 papers in training set
Top 0.5%
1.3%
17
Quantitative Biology
11 papers in training set
Top 0.4%
1.1%
18
Scientific Reports
3102 papers in training set
Top 70%
0.9%
19
Advanced Science
249 papers in training set
Top 16%
0.9%
20
Computational Biology and Chemistry
23 papers in training set
Top 0.4%
0.8%
21
IEEE Transactions on Computational Biology and Bioinformatics
17 papers in training set
Top 0.6%
0.8%
22
NAR Genomics and Bioinformatics
214 papers in training set
Top 4%
0.8%
23
International Journal of Molecular Sciences
453 papers in training set
Top 17%
0.7%
24
Frontiers in Molecular Biosciences
100 papers in training set
Top 7%
0.5%
25
iScience
1063 papers in training set
Top 39%
0.5%
26
National Science Review
22 papers in training set
Top 3%
0.5%