Accurate protein stability prediction for small domains using mega-scale experiments
Cho, Y.; Tsuboyama, K.; Litberg, T. J.; Jung, M. D.; Obisesan, A.; Wang, Q.; Phoumyvong, C. M.; Thibeault, J.; Ovchinnikov, S.; Rocklin, G. J.
Show abstract
Predicting absolute protein folding stability is a long-standing challenge in biophysics, with broad applications in protein design and in understanding genetic variation and evolution. Physics-based simulations have shown limited success at predicting stability and are often computationally intractable, and machine learning methods have been constrained by the lack of sufficiently large experimental datasets. We recently introduced cDNA display proteolysis, a cell-free approach that can measure folding stability for nearly one million protein domains in parallel. Here, we applied this method to measure stability for 1.8 million diverse protein domains 60-80 amino acids in length primarily taken from the MGnify metagenomic database and spanning over 200,000 sequence families. Using this new "MGnify Stability dataset", we developed the predictive models SaProt{Delta}G and ESM3{Delta}G, which accurately predict absolute folding stability for small domains with root mean squared error of 0.8 kcal/mol over a 6 kcal/mol range (Spearman rank correlation of 0.88). These predictors show high accuracy at predicting effects of substitutions, insertions, and deletions, successfully identify global trends toward higher stability in thermophilic organisms, and improve discrimination of stable and unstable computationally designed proteins. Our results illustrate how megascale biophysical measurements can complement existing evolutionary and structural data to enable accurate absolute stability prediction for small domains.
Matching journals
The top 5 journals account for 50% of the predicted probability mass.