Back

Accurate protein stability prediction for small domains using mega-scale experiments

Cho, Y.; Tsuboyama, K.; Litberg, T. J.; Jung, M. D.; Obisesan, A.; Wang, Q.; Phoumyvong, C. M.; Thibeault, J.; Ovchinnikov, S.; Rocklin, G. J.

2026-05-20 biophysics
10.64898/2026.05.19.726285 bioRxiv
Show abstract

Predicting absolute protein folding stability is a long-standing challenge in biophysics, with broad applications in protein design and in understanding genetic variation and evolution. Physics-based simulations have shown limited success at predicting stability and are often computationally intractable, and machine learning methods have been constrained by the lack of sufficiently large experimental datasets. We recently introduced cDNA display proteolysis, a cell-free approach that can measure folding stability for nearly one million protein domains in parallel. Here, we applied this method to measure stability for 1.8 million diverse protein domains 60-80 amino acids in length primarily taken from the MGnify metagenomic database and spanning over 200,000 sequence families. Using this new "MGnify Stability dataset", we developed the predictive models SaProt{Delta}G and ESM3{Delta}G, which accurately predict absolute folding stability for small domains with root mean squared error of 0.8 kcal/mol over a 6 kcal/mol range (Spearman rank correlation of 0.88). These predictors show high accuracy at predicting effects of substitutions, insertions, and deletions, successfully identify global trends toward higher stability in thermophilic organisms, and improve discrimination of stable and unstable computationally designed proteins. Our results illustrate how megascale biophysical measurements can complement existing evolutionary and structural data to enable accurate absolute stability prediction for small domains.

Matching journals

The top 5 journals account for 50% of the predicted probability mass.

1
Protein Science
221 papers in training set
Top 0.1%
14.0%
2
Structure
175 papers in training set
Top 0.1%
12.2%
3
Nature Communications
4913 papers in training set
Top 20%
9.9%
4
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 8%
8.2%
5
Biophysical Journal
545 papers in training set
Top 1.0%
6.2%
50% of probability mass above
6
Cell Systems
167 papers in training set
Top 4%
3.6%
7
eLife
5422 papers in training set
Top 27%
3.5%
8
Journal of Molecular Biology
217 papers in training set
Top 0.7%
3.5%
9
PLOS Computational Biology
1633 papers in training set
Top 11%
3.0%
10
Nucleic Acids Research
1128 papers in training set
Top 7%
3.0%
11
Science
429 papers in training set
Top 11%
2.5%
12
Journal of Chemical Information and Modeling
207 papers in training set
Top 2%
2.0%
13
Nature Methods
336 papers in training set
Top 4%
1.8%
14
PLOS ONE
4510 papers in training set
Top 55%
1.7%
15
The Journal of Physical Chemistry B
158 papers in training set
Top 1%
1.6%
16
Journal of Chemical Theory and Computation
126 papers in training set
Top 0.6%
1.5%
17
Bioinformatics
1061 papers in training set
Top 8%
1.3%
18
Communications Biology
886 papers in training set
Top 15%
1.2%
19
Scientific Reports
3102 papers in training set
Top 67%
1.2%
20
Proteins: Structure, Function, and Bioinformatics
82 papers in training set
Top 0.7%
1.1%
21
The Journal of Physical Chemistry Letters
58 papers in training set
Top 1%
0.9%
22
IUCrJ
29 papers in training set
Top 0.3%
0.9%
23
Frontiers in Molecular Biosciences
100 papers in training set
Top 4%
0.9%
24
Biochemistry
130 papers in training set
Top 2%
0.7%
25
Journal of the American Chemical Society
199 papers in training set
Top 5%
0.7%
26
Bioinformatics Advances
184 papers in training set
Top 5%
0.7%
27
Nature Computational Science
50 papers in training set
Top 2%
0.6%