Back

Protein structure-informed deep learning enables species-specific codon optimization

Jin, W.; Tan, W.; Li, H.; Ji, X.; Li, M.; Zhang, D.; Xu, J.

2026-04-24 molecular biology
10.64898/2026.04.21.720047 bioRxiv
Show abstract

Codon usage bias is highly species-specific, posing a major challenge for heterologous protein expression. Existing deep learning approaches to codon optimization rely primarily on DNA or protein sequence information and largely neglect constraints imposed by protein structure and folding. Here, we present Protein structure-Informed Species-specific Codon Optimization (PISCO), a Geometric Vector Perceptron (GVP)-based model that integrates protein sequence, three-dimensional protein structure, and host codon usage statistics to generate optimal, host-specific codon sequences. Compared with protein-structure-agnostic models, PISCO improves codon recovery by 6% and substantially increases similarity to natural coding sequences, reducing divergence by at least 42% in Codon Similarity Index (CSI), 50% in Codon Frequency Distribution (CFD), and 14% in Dynamic Time Warping (DTW) metrics. Ablation analyses demonstrate that incorporating protein folding kinetics and host-specific information is critical to these gains. Moreover, by leveraging host codon usage statistics, PISCO generalizes to optimize codon sequences for species absent from the training data. An autoregressive variant of PISCO further enhances concordance with natural codon usage patterns, at the cost of a modest reduction in codon recovery rate. Wet-lab validation confirms that PISCO-optimized sequences significantly enhance protein solubility and functional expression. Together, these results establish protein structure as a key determinant of species-specific codon optimization and provide a transferable framework for structure-aware gene design.

Matching journals

The top 4 journals account for 50% of the predicted probability mass.

1
Nature Communications
4913 papers in training set
Top 6%
18.5%
2
Nature Machine Intelligence
61 papers in training set
Top 0.1%
17.4%
3
Cell Systems
167 papers in training set
Top 0.8%
12.4%
4
Nucleic Acids Research
1128 papers in training set
Top 2%
7.1%
50% of probability mass above
5
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 14%
4.8%
6
Nature Biotechnology
147 papers in training set
Top 2%
4.6%
7
Genome Biology
555 papers in training set
Top 3%
2.9%
8
Advanced Science
249 papers in training set
Top 8%
2.4%
9
Science
429 papers in training set
Top 14%
1.7%
10
Nature Methods
336 papers in training set
Top 4%
1.7%
11
PLOS Computational Biology
1633 papers in training set
Top 17%
1.7%
12
Scientific Reports
3102 papers in training set
Top 62%
1.5%
13
Communications Biology
886 papers in training set
Top 14%
1.2%
14
Journal of Structural Biology
58 papers in training set
Top 1%
1.1%
15
Cell Genomics
162 papers in training set
Top 6%
0.9%
16
Cell Reports
1338 papers in training set
Top 32%
0.8%
17
eLife
5422 papers in training set
Top 58%
0.7%
18
PLOS ONE
4510 papers in training set
Top 68%
0.7%
19
Cell Reports Medicine
140 papers in training set
Top 8%
0.7%
20
Genome Medicine
154 papers in training set
Top 8%
0.7%
21
ACS Synthetic Biology
256 papers in training set
Top 3%
0.7%
22
Bioinformatics
1061 papers in training set
Top 10%
0.7%
23
Cell Reports Methods
141 papers in training set
Top 6%
0.6%
24
Protein Science
221 papers in training set
Top 2%
0.6%