Back

Decoding the Sequence Requirements for Translation Initiation

Verhagen, B. M.; Liedtke, D.; Barbadilla-Martinez, L.; Alverado, C.; Petrychenko, V.; Swirski, M.; Muller, M.; Valen, E.; Puglisi, J.; de Ridder, J.; Fischer, N.; Tanenbaum, M. E.

2026-05-12 molecular biology
10.64898/2026.05.12.723742 bioRxiv
Show abstract

Accurate selection of start codons by ribosomes is a fundamental determinant of proteome composition. Although the Kozak sequence--an 8-nucleotide sequence flanking the start codon--has long been viewed as the primary determinant of initiation in eukaryotes, it fails to explain the large diversity of start codon usage across transcripts. Here we combine massively parallel reporter assays, bioinformatics, machine learning, single-molecule imaging and cryo-electron microscopy to define the extended translation initiation sequence (eTIS), an [~]80-nucleotide sequence surrounding the start codon that governs initiation efficiency. A deep-learning model trained on eTIS features accurately predicts translation initiation across transcripts. Unexpectedly, we find that the Kozak sequence is not optimal for initiation as is widely presumed, and we identify the origin of this discrepancy. eTIS nucleotides that promote efficient initiation are enriched in the human transcriptome and are evolutionarily conserved, underscoring their functional importance. Biophysical and structural analyses reveal that specific eTIS residues--including the key +6 position and residues in the mRNA entry and exit channel--engage ribosomal proteins, rRNA and initiation factors to promote start codon recognition by stabilizing the ribosome at the start codon and facilitating the structural transitions required for initiation. Finally, optimization of the eTIS markedly enhances translational fidelity and protein output from therapeutic mRNAs, highlighting its practical utility. Together, these findings redefine the sequence logic of translation initiation and establish a framework for precise control of protein expression.

Matching journals

The top 4 journals account for 50% of the predicted probability mass.

1
Science
429 papers in training set
Top 0.6%
18.1%
2
Nature Communications
4913 papers in training set
Top 12%
14.0%
3
Nature Structural & Molecular Biology
218 papers in training set
Top 0.1%
14.0%
4
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 8%
8.2%
50% of probability mass above
5
Nature
575 papers in training set
Top 4%
6.6%
6
Molecular Cell
308 papers in training set
Top 4%
4.2%
7
Cell
370 papers in training set
Top 7%
3.5%
8
Science Advances
1098 papers in training set
Top 12%
2.3%
9
Nature Biotechnology
147 papers in training set
Top 4%
2.3%
10
Cell Systems
167 papers in training set
Top 7%
1.8%
11
Cell Reports
1338 papers in training set
Top 23%
1.8%
12
eLife
5422 papers in training set
Top 44%
1.6%
13
Nature Cell Biology
99 papers in training set
Top 3%
1.4%
14
Developmental Cell
168 papers in training set
Top 10%
1.3%
15
Nature Methods
336 papers in training set
Top 5%
1.3%
16
Neuron
282 papers in training set
Top 8%
0.9%
17
Nature Microbiology
133 papers in training set
Top 4%
0.9%
18
Advanced Science
249 papers in training set
Top 17%
0.9%
19
Nucleic Acids Research
1128 papers in training set
Top 17%
0.8%
20
Nature Machine Intelligence
61 papers in training set
Top 3%
0.8%
21
Nature Chemical Biology
104 papers in training set
Top 4%
0.8%
22
Nature Genetics
240 papers in training set
Top 8%
0.7%
23
The Lancet Infectious Diseases
71 papers in training set
Top 3%
0.7%