Back

Ensemble-conditioned protein sequence design with Caliby

Shuai, R. W.; Lu, T.; Bhatti, S.; Kouba, P.; Huang, P.

2025-10-02 bioengineering
10.1101/2025.09.30.679633 bioRxiv
Show abstract

Structure-conditioned sequence design models aim to design a protein sequence that will fold into a given target structure. Deep-learning-based approaches for sequence design have proven highly successful for various protein design applications, but many non-idealized backbones still remain out of reach for current models under typical in silico success criteria. We hypothesize that training objectives prioritizing native sequence recovery unintentionally push models to reproduce non-structural signals (e.g. phylogenetic relatedness, neutral drift, or dataset sampling biases), rather than a broadly generalizable structure-sequence mapping. Inspired by recent work bridging sequence likelihood and fitness prediction in protein language models, we introduce Caliby, a Potts model-based sequence design method capable of conditioning on an ensemble of structures. Conditioning on a synthetic ensemble generated from an input backbone allows sampling of sequences consistent with the structural constraints of the ensemble while averaging out undesired biases towards the native sequence. Ensemble-conditioned sequence design with Caliby reduces native sequence recovery while substantially improving AlphaFold2 self-consistency, outperforming state-of-the-art models ProteinMPNN and ChromaDesign on both native and de novo backbones. Finally, we train a variant of Caliby on only soluble proteins and demonstrate in silico that Protpardelle-1c binder designs that were previously deemed undesignable by SolubleMPNN are actually designable under SolubleCaliby, highlighting limitations of existing filtering pipelines. These results suggest that Caliby can expand the de novo design space beyond highly idealized backbones.

Matching journals

The top 6 journals account for 50% of the predicted probability mass.

1
Cell Systems
167 papers in training set
Top 0.2%
23.4%
2
Nature Methods
336 papers in training set
Top 1%
7.4%
3
Nature Communications
4913 papers in training set
Top 24%
7.4%
4
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 10%
6.6%
5
Protein Engineering, Design and Selection
14 papers in training set
Top 0.1%
5.0%
6
PLOS Computational Biology
1633 papers in training set
Top 8%
4.5%
50% of probability mass above
7
Bioinformatics
1061 papers in training set
Top 5%
3.7%
8
Nature Biotechnology
147 papers in training set
Top 2%
3.7%
9
Science
429 papers in training set
Top 8%
3.7%
10
Nature Machine Intelligence
61 papers in training set
Top 0.8%
3.7%
11
Protein Science
221 papers in training set
Top 0.4%
3.2%
12
Proteins: Structure, Function, and Bioinformatics
82 papers in training set
Top 0.4%
2.0%
13
Structure
175 papers in training set
Top 1%
2.0%
14
Scientific Reports
3102 papers in training set
Top 52%
2.0%
15
Journal of Chemical Information and Modeling
207 papers in training set
Top 2%
1.4%
16
Computational and Structural Biotechnology Journal
216 papers in training set
Top 6%
1.3%
17
ACS Synthetic Biology
256 papers in training set
Top 2%
0.9%
18
Nature Medicine
117 papers in training set
Top 4%
0.9%
19
Cell Reports Methods
141 papers in training set
Top 4%
0.9%
20
Advanced Science
249 papers in training set
Top 17%
0.8%
21
Nature
575 papers in training set
Top 15%
0.8%
22
Chemical Science
71 papers in training set
Top 2%
0.8%
23
Communications Biology
886 papers in training set
Top 24%
0.7%
24
Nature Computational Science
50 papers in training set
Top 2%
0.7%
25
PLOS ONE
4510 papers in training set
Top 71%
0.7%
26
Briefings in Bioinformatics
326 papers in training set
Top 7%
0.7%
27
Nucleic Acids Research
1128 papers in training set
Top 20%
0.5%