A trainable language model for modulating translation rates in non-model organisms by generating upstream untranslated region sequence libraries

Duggan, A. D.; Newman, M. P.; McMillen, D. R.

2026-04-20 synthetic biology

10.64898/2026.04.18.719341 bioRxiv

Show abstract

Tuning protein expression in non-model organisms is often constrained by the lack of validated genetic parts and predictive design tools. Translational tuning through the modulation of upstream untranslated regions (5'-UTRs) offers a potentially organism-agnostic route, but existing methods typically rely on mechanistic assumptions, prior knowledge that may not be available in non-model contexts, or the screening of sequence libraries. Here, we present a simple generative approach for creating synthetic 5'-UTR libraries based solely on the genomic sequence statistics of any desired organism. The method uses a sliding-window n-gram language model applied to native 5'-UTR sequences to produce novel sequences that preserve organism-specific base distributions and motifs without hard-coding specific motifs or mechanistic rules into inflexible statistical templates. We have applied this approach to the model bacterium Escherichia coli and the non-model probiotic Limosilactobacillus reuteri. Libraries of approximately 1,000 sequences were generated for each organism, from which about 100 unique sequences were experimentally tested for translation of a fluorescent reporter protein. In both organisms, the synthetic libraries yielded a broad range of translation levels from this relatively small number of tested variants. Sequences derived from an organisms own genomic statistics generally performed better in that organism than sequences derived from the other species. Correlations of individual sequence performance across the two species were weak, and thermodynamic predictions of ribosome binding strength showed very little predictive power, especially in the non-model L. reuteri. The results demonstrate that simple statistical language model approaches applied to genomic data can generate functional translational regulatory sequence libraries without detailed mechanistic knowledge or explicit reference to consensus motifs. The approach requires minimal computational resources, avoids reproducing native sequences, and can be readily applied to any organism with a sequenced genome. This strategy may lower technical barriers to expression tuning in non-model organisms.

A trainable language model for modulating translation rates in non-model organisms by generating upstream untranslated region sequence libraries

Matching journals