Back

Towards Compilation of Balanced Protein Stability Datasets: Flattening the ΔΔG Curve through Systematic Under-sampling

Kebabci, N.; Timucin, A. C.; Timucin, E.

2021-09-20 bioinformatics
10.1101/2021.09.17.460216 bioRxiv
Show abstract

Protein stability datasets contain neutral mutations that are highly concentrated in a much narrower {Delta}{Delta}G range than destabilizing and stabilizing mutations. Notwith-standing their high density, often studies analyzing stability datasets and/or predictors ignore the neutral mutations and use a binary classification scheme labeling only destabilizing and stabilizing mutations. Recognizing that highly concentrated neutral mutations would affect the quality of stability datasets, we have explored three protein stability datasets; S2648, PON-tstab and the symmetric Ssym that differ in size and quality. A characteristic leptokurtic shape in the {Delta}{Delta}G distributions of all three datasets including the curated and symmetric ones were reported due to concentrated neutral mutations. To further investigate the impact of neutral mutations on {Delta}{Delta}G predictions, we have comprehensively assessed the performance of eleven predictors on the PON-tstab dataset. Correlation and error analyses showed that all of the predictors performed the best on the neutral mutations while their performance became gradually worse as the {Delta}{Delta}G of the mutations departed further from the neutral zone regardless of the direction, implying a bias towards dense mutations. To this end, after unraveling the role of concentrated neutral mutations in biases of stability datasets, we described a systematic under-sampling approach to balance the {Delta}{Delta}G distributions. Before under-sampling, mutations were clustered based on their biochemical and/or structural features and then three mutations were systematically selected from every 2 kcal/mol of each cluster. Upon implementation of this approach by distinct clustering schemes, we generated five subsets varying in size and {Delta}{Delta}G distributions. All subsets notably showed amelioration of not only the shape of {Delta}{Delta}G distributions but also other pre-existing imbalances in the frequency distributions. We also reported differences in the performance of the predictors between the parent and under-sampled subsets due to the enrichment of previously under-represented mutations in the subsets. Altogether, this study not only elaborated the pivotal role of concentrated mutations in the dataset biases but also contemplated and realized a rational strategy to tackle this and other forms of biases. Under-sampling code is available on GitHub (https://github.com/narodkebabci/gRoR).

Matching journals

The top 5 journals account for 50% of the predicted probability mass.

1
Journal of Molecular Biology
217 papers in training set
Top 0.1%
18.0%
2
Nature Communications
4913 papers in training set
Top 12%
13.9%
3
Protein Science
221 papers in training set
Top 0.1%
9.8%
4
Journal of Chemical Theory and Computation
126 papers in training set
Top 0.2%
6.1%
5
PLOS Computational Biology
1633 papers in training set
Top 7%
4.7%
50% of probability mass above
6
Briefings in Bioinformatics
326 papers in training set
Top 2%
4.0%
7
Bioinformatics
1061 papers in training set
Top 5%
3.5%
8
Journal of Chemical Information and Modeling
207 papers in training set
Top 1%
3.5%
9
Computational and Structural Biotechnology Journal
216 papers in training set
Top 3%
2.5%
10
Nucleic Acids Research
1128 papers in training set
Top 9%
2.0%
11
Journal of Cheminformatics
25 papers in training set
Top 0.2%
2.0%
12
Proteins: Structure, Function, and Bioinformatics
82 papers in training set
Top 0.4%
1.8%
13
Communications Chemistry
39 papers in training set
Top 0.3%
1.6%
14
eLife
5422 papers in training set
Top 46%
1.4%
15
International Journal of Molecular Sciences
453 papers in training set
Top 10%
1.3%
16
Scientific Reports
3102 papers in training set
Top 67%
1.2%
17
PLOS ONE
4510 papers in training set
Top 63%
0.9%
18
Cell Systems
167 papers in training set
Top 10%
0.9%
19
Molecular Systems Biology
142 papers in training set
Top 1%
0.9%
20
Molecular Biology and Evolution
488 papers in training set
Top 4%
0.9%
21
Chemical Science
71 papers in training set
Top 2%
0.9%
22
NAR Genomics and Bioinformatics
214 papers in training set
Top 3%
0.9%
23
Frontiers in Molecular Biosciences
100 papers in training set
Top 5%
0.8%
24
The Journal of Physical Chemistry B
158 papers in training set
Top 2%
0.8%
25
Biomolecules
95 papers in training set
Top 2%
0.7%
26
PeerJ
261 papers in training set
Top 16%
0.7%
27
Bioinformatics Advances
184 papers in training set
Top 5%
0.7%
28
Nature Methods
336 papers in training set
Top 6%
0.7%
29
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 46%
0.7%
30
Structure
175 papers in training set
Top 4%
0.6%