Towards Compilation of Balanced Protein Stability Datasets: Flattening the ΔΔG Curve through Systematic Under-sampling
Kebabci, N.; Timucin, A. C.; Timucin, E.
Show abstract
Protein stability datasets contain neutral mutations that are highly concentrated in a much narrower {Delta}{Delta}G range than destabilizing and stabilizing mutations. Notwith-standing their high density, often studies analyzing stability datasets and/or predictors ignore the neutral mutations and use a binary classification scheme labeling only destabilizing and stabilizing mutations. Recognizing that highly concentrated neutral mutations would affect the quality of stability datasets, we have explored three protein stability datasets; S2648, PON-tstab and the symmetric Ssym that differ in size and quality. A characteristic leptokurtic shape in the {Delta}{Delta}G distributions of all three datasets including the curated and symmetric ones were reported due to concentrated neutral mutations. To further investigate the impact of neutral mutations on {Delta}{Delta}G predictions, we have comprehensively assessed the performance of eleven predictors on the PON-tstab dataset. Correlation and error analyses showed that all of the predictors performed the best on the neutral mutations while their performance became gradually worse as the {Delta}{Delta}G of the mutations departed further from the neutral zone regardless of the direction, implying a bias towards dense mutations. To this end, after unraveling the role of concentrated neutral mutations in biases of stability datasets, we described a systematic under-sampling approach to balance the {Delta}{Delta}G distributions. Before under-sampling, mutations were clustered based on their biochemical and/or structural features and then three mutations were systematically selected from every 2 kcal/mol of each cluster. Upon implementation of this approach by distinct clustering schemes, we generated five subsets varying in size and {Delta}{Delta}G distributions. All subsets notably showed amelioration of not only the shape of {Delta}{Delta}G distributions but also other pre-existing imbalances in the frequency distributions. We also reported differences in the performance of the predictors between the parent and under-sampled subsets due to the enrichment of previously under-represented mutations in the subsets. Altogether, this study not only elaborated the pivotal role of concentrated mutations in the dataset biases but also contemplated and realized a rational strategy to tackle this and other forms of biases. Under-sampling code is available on GitHub (https://github.com/narodkebabci/gRoR).
Matching journals
The top 5 journals account for 50% of the predicted probability mass.