Machine Learning Method for Optimizing Coding Sequences in Mammalian Cells
Theodorou, E.; Stadler, M.; Gustafsson, C.; Welch, M.
Show abstract
Mammalian cell lines are the preferred hosts for producing commercially relevant therapeutic proteins such as antibodies, multispecifics, and cytokine fusion proteins. Even though significant investment is made to optimize upstream and downstream processes, the optimal gene design parameters for heterologous recombinant protein expression remain poorly understood. We describe here a generic approach to gene optimization in which design features are systematically sampled and modulated iteratively using machine learning (ML). Synthetic genes encoding the Dasher fluorescent protein, differing only in synonymous codons, were used to interrogate the gene-sequence preferences of transient antibody-expressing HEK293 cells. Synonymous codon variations influenced expression by more than two orders of magnitude. This variation in protein yield was used to build ML models relating gene design features, which were then employed to design further-improved genes. The ML models were shown to be expression system-specific. Messenger RNA levels and ribosome occupancy were highly correlated with protein levels, suggesting that mRNA lifetime has a causal relationship with coding bias. Our results illustrate a novel, generally applicable method to improve gene expression via synonymous re-coding for any protein target or host cell.
Matching journals
The top 4 journals account for 50% of the predicted probability mass.