Back

Taxonomy-aware, disorder-matched benchmarking of phase-separating protein predictors

Hou, S.; Shen, H.; Zhang, Y.

2026-02-12 bioinformatics
10.64898/2026.02.11.705241 bioRxiv
Show abstract

BackgroundBiomolecular condensates formed via liquid-liquid phase separation (LLPS) play vital roles in cellular organization and function. Computational prediction of phase-separating proteins (PSPs) is increasingly used to prioritize candidates at proteome scale, making robust, well-designed benchmarks essential for fair evaluation and iterative improvement of PSP predictors. ResultsWe first show that a recently released PSP benchmark is substantially confounded by the imbalances in taxonomic origin and intrinsic-disorder compositions between positive and negative sets, allowing predictors to achieve high apparent performance by exploiting non-LLPS shortcuts and obscuring their true ability to distinguish PSPs. To minimize these effects, we construct a taxonomy-aware, disorder-matched PSP benchmark. Using this benchmark, we find that absolute sequence and biophysical feature values of PSPs differ markedly across taxa, whereas LLPS-associated feature shifts relative to taxon-specific proteome backgrounds are comparatively conserved. Benchmarking twenty PSP predictors under this framework reveals pronounced taxon-dependent variation in performance. Moreover, PSPs lacking IDRs consistently constitute a more challenging regime across methods, motivating routine disorder-stratified evaluation. ConclusionsOur taxonomy-aware, disorder-matched benchmarking framework reduces shortcut-driven biases, enables more interpretable evaluation of PSP predictors, and provides guidance for developing models that capture transferable LLPS-associated signals rather than dataset- or taxon-specific shortcuts.

Matching journals

The top 7 journals account for 50% of the predicted probability mass.

1
Bioinformatics
1061 papers in training set
Top 1%
22.2%
2
Nature Communications
4913 papers in training set
Top 23%
8.3%
3
Journal of Proteome Research
215 papers in training set
Top 0.5%
6.2%
4
Protein Science
221 papers in training set
Top 0.3%
4.1%
5
Cell Systems
167 papers in training set
Top 3%
3.9%
6
PLOS Computational Biology
1633 papers in training set
Top 9%
3.9%
7
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 21%
3.5%
50% of probability mass above
8
Molecular & Cellular Proteomics
158 papers in training set
Top 0.7%
3.5%
9
BMC Bioinformatics
383 papers in training set
Top 3%
3.5%
10
Briefings in Bioinformatics
326 papers in training set
Top 2%
3.0%
11
NAR Genomics and Bioinformatics
214 papers in training set
Top 1%
2.6%
12
Advanced Science
249 papers in training set
Top 8%
2.3%
13
Computational and Structural Biotechnology Journal
216 papers in training set
Top 4%
1.9%
14
Scientific Reports
3102 papers in training set
Top 56%
1.8%
15
Communications Biology
886 papers in training set
Top 9%
1.7%
16
Molecular Systems Biology
142 papers in training set
Top 0.6%
1.7%
17
PLOS ONE
4510 papers in training set
Top 57%
1.5%
18
Genome Biology
555 papers in training set
Top 5%
1.5%
19
Nature Methods
336 papers in training set
Top 5%
1.3%
20
Cell Reports Methods
141 papers in training set
Top 3%
1.2%
21
Nature Machine Intelligence
61 papers in training set
Top 3%
1.1%
22
iScience
1063 papers in training set
Top 25%
0.9%
23
Bioinformatics Advances
184 papers in training set
Top 4%
0.9%
24
Patterns
70 papers in training set
Top 2%
0.8%
25
Nucleic Acids Research
1128 papers in training set
Top 17%
0.8%
26
Nature Biotechnology
147 papers in training set
Top 7%
0.7%
27
Analytical Chemistry
205 papers in training set
Top 3%
0.7%
28
PROTEOMICS
35 papers in training set
Top 0.9%
0.7%
29
npj Systems Biology and Applications
99 papers in training set
Top 3%
0.6%
30
Journal of Molecular Biology
217 papers in training set
Top 4%
0.6%