Back

Machine Learning Method for Optimizing Coding Sequences in Mammalian Cells

Theodorou, E.; Stadler, M.; Gustafsson, C.; Welch, M.

2026-01-28 bioengineering
10.64898/2026.01.26.701778 bioRxiv
Show abstract

Mammalian cell lines are the preferred hosts for producing commercially relevant therapeutic proteins such as antibodies, multispecifics, and cytokine fusion proteins. Even though significant investment is made to optimize upstream and downstream processes, the optimal gene design parameters for heterologous recombinant protein expression remain poorly understood. We describe here a generic approach to gene optimization in which design features are systematically sampled and modulated iteratively using machine learning (ML). Synthetic genes encoding the Dasher fluorescent protein, differing only in synonymous codons, were used to interrogate the gene-sequence preferences of transient antibody-expressing HEK293 cells. Synonymous codon variations influenced expression by more than two orders of magnitude. This variation in protein yield was used to build ML models relating gene design features, which were then employed to design further-improved genes. The ML models were shown to be expression system-specific. Messenger RNA levels and ribosome occupancy were highly correlated with protein levels, suggesting that mRNA lifetime has a causal relationship with coding bias. Our results illustrate a novel, generally applicable method to improve gene expression via synonymous re-coding for any protein target or host cell.

Matching journals

The top 4 journals account for 50% of the predicted probability mass.

1
Frontiers in Bioengineering and Biotechnology
88 papers in training set
Top 0.1%
18.7%
2
ACS Synthetic Biology
256 papers in training set
Top 0.3%
17.5%
3
Cell Systems
167 papers in training set
Top 1%
9.1%
4
Scientific Reports
3102 papers in training set
Top 18%
6.4%
50% of probability mass above
5
Biotechnology and Bioengineering
49 papers in training set
Top 0.1%
4.8%
6
Metabolic Engineering
68 papers in training set
Top 0.2%
4.0%
7
Journal of The Royal Society Interface
189 papers in training set
Top 1.0%
4.0%
8
Computational and Structural Biotechnology Journal
216 papers in training set
Top 2%
2.7%
9
Nucleic Acids Research
1128 papers in training set
Top 7%
2.7%
10
Bioinformatics
1061 papers in training set
Top 7%
1.9%
11
iScience
1063 papers in training set
Top 15%
1.7%
12
PLOS Computational Biology
1633 papers in training set
Top 16%
1.7%
13
Nature Communications
4913 papers in training set
Top 52%
1.7%
14
PLOS ONE
4510 papers in training set
Top 55%
1.7%
15
Cell Reports Methods
141 papers in training set
Top 2%
1.7%
16
Frontiers in Molecular Biosciences
100 papers in training set
Top 3%
0.9%
17
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 44%
0.7%
18
BMC Bioinformatics
383 papers in training set
Top 7%
0.7%
19
PeerJ
261 papers in training set
Top 15%
0.7%
20
International Journal of Molecular Sciences
453 papers in training set
Top 17%
0.7%
21
Proteins: Structure, Function, and Bioinformatics
82 papers in training set
Top 1%
0.6%
22
Bioengineering
24 papers in training set
Top 2%
0.6%
23
Communications Biology
886 papers in training set
Top 29%
0.6%