Back

A trainable language model for modulating translation rates in non-model organisms by generating upstream untranslated region sequence libraries

Duggan, A. D.; Newman, M. P.; McMillen, D. R.

2026-04-20 synthetic biology
10.64898/2026.04.18.719341 bioRxiv
Show abstract

Tuning protein expression in non-model organisms is often constrained by the lack of validated genetic parts and predictive design tools. Translational tuning through the modulation of upstream untranslated regions (5'-UTRs) offers a potentially organism-agnostic route, but existing methods typically rely on mechanistic assumptions, prior knowledge that may not be available in non-model contexts, or the screening of sequence libraries. Here, we present a simple generative approach for creating synthetic 5'-UTR libraries based solely on the genomic sequence statistics of any desired organism. The method uses a sliding-window n-gram language model applied to native 5'-UTR sequences to produce novel sequences that preserve organism-specific base distributions and motifs without hard-coding specific motifs or mechanistic rules into inflexible statistical templates. We have applied this approach to the model bacterium Escherichia coli and the non-model probiotic Limosilactobacillus reuteri. Libraries of approximately 1,000 sequences were generated for each organism, from which about 100 unique sequences were experimentally tested for translation of a fluorescent reporter protein. In both organisms, the synthetic libraries yielded a broad range of translation levels from this relatively small number of tested variants. Sequences derived from an organisms own genomic statistics generally performed better in that organism than sequences derived from the other species. Correlations of individual sequence performance across the two species were weak, and thermodynamic predictions of ribosome binding strength showed very little predictive power, especially in the non-model L. reuteri. The results demonstrate that simple statistical language model approaches applied to genomic data can generate functional translational regulatory sequence libraries without detailed mechanistic knowledge or explicit reference to consensus motifs. The approach requires minimal computational resources, avoids reproducing native sequences, and can be readily applied to any organism with a sequenced genome. This strategy may lower technical barriers to expression tuning in non-model organisms.

Matching journals

The top 2 journals account for 50% of the predicted probability mass.

1
Synthetic Biology
21 papers in training set
Top 0.1%
38.3%
2
ACS Synthetic Biology
256 papers in training set
Top 0.2%
22.8%
50% of probability mass above
3
Nucleic Acids Research
1128 papers in training set
Top 5%
4.0%
4
PLOS Computational Biology
1633 papers in training set
Top 8%
4.0%
5
PLOS ONE
4510 papers in training set
Top 35%
4.0%
6
Bioinformatics
1061 papers in training set
Top 5%
3.6%
7
BMC Bioinformatics
383 papers in training set
Top 3%
3.1%
8
Computational and Structural Biotechnology Journal
216 papers in training set
Top 4%
1.7%
9
mSystems
361 papers in training set
Top 5%
1.3%
10
Frontiers in Bioengineering and Biotechnology
88 papers in training set
Top 2%
1.1%
11
Journal of The Royal Society Interface
189 papers in training set
Top 3%
1.1%
12
Philosophical Transactions of the Royal Society B
51 papers in training set
Top 5%
0.9%
13
Nature Communications
4913 papers in training set
Top 64%
0.7%
14
Briefings in Bioinformatics
326 papers in training set
Top 7%
0.7%
15
Frontiers in Microbiology
375 papers in training set
Top 10%
0.7%
16
Bioinformatics Advances
184 papers in training set
Top 5%
0.7%
17
NAR Genomics and Bioinformatics
214 papers in training set
Top 4%
0.7%
18
Bioengineering
24 papers in training set
Top 2%
0.5%
19
PeerJ
261 papers in training set
Top 19%
0.5%
20
Nature Biotechnology
147 papers in training set
Top 9%
0.5%
21
Metabolic Engineering
68 papers in training set
Top 0.9%
0.5%
22
Genome Biology
555 papers in training set
Top 9%
0.5%