Back

Spurious correlation inflates performance in single-cell perturbation prediction

Nicol, P. B.; Shivakumar, S.; Irizarry, R.

2026-05-12 bioinformatics
10.64898/2026.05.07.723486 bioRxiv
Show abstract

The increasing number of computational methods designed to predict the effects of genetic perturbations on cellular gene expression profiles has led to a need for rigorous evaluation metrics. Recent benchmarking studies rely on correlation or cosine similarity of differential expression relative to a shared population of control cells. We show that these metrics are systematically inflated by statistical bias induced by reusing the same control population to define both quantities being compared. As a result, even non-informative methods can appear to perform well, particularly in datasets with limited numbers of control cells. Reanalysis of published datasets using a simple control-splitting procedure that removes this bias leads to a substantial reduction in performance previously attributed to biological signal.

Matching journals

The top 5 journals account for 50% of the predicted probability mass.

1
PLOS Computational Biology
1633 papers in training set
Top 1.0%
19.1%
2
Bioinformatics
1061 papers in training set
Top 2%
12.6%
3
BMC Bioinformatics
383 papers in training set
Top 0.7%
10.7%
4
Cell Systems
167 papers in training set
Top 2%
7.0%
5
Nature Communications
4913 papers in training set
Top 39%
3.7%
50% of probability mass above
6
Scientific Reports
3102 papers in training set
Top 34%
3.7%
7
Physical Biology
43 papers in training set
Top 0.6%
2.9%
8
PLOS ONE
4510 papers in training set
Top 45%
2.5%
9
npj Systems Biology and Applications
99 papers in training set
Top 0.8%
2.1%
10
Journal of Cell Science
353 papers in training set
Top 0.9%
1.9%
11
Genome Research
409 papers in training set
Top 2%
1.8%
12
Nucleic Acids Research
1128 papers in training set
Top 10%
1.7%
13
NAR Genomics and Bioinformatics
214 papers in training set
Top 2%
1.7%
14
Briefings in Bioinformatics
326 papers in training set
Top 4%
1.7%
15
Genetics
225 papers in training set
Top 3%
1.4%
16
Frontiers in Genetics
197 papers in training set
Top 6%
1.4%
17
Communications Biology
886 papers in training set
Top 12%
1.4%
18
Frontiers in Molecular Biosciences
100 papers in training set
Top 3%
1.3%
19
Biophysical Journal
545 papers in training set
Top 4%
0.9%
20
iScience
1063 papers in training set
Top 25%
0.9%
21
The Annals of Applied Statistics
15 papers in training set
Top 0.1%
0.8%
22
Physical Review E
95 papers in training set
Top 1%
0.8%
23
Bioinformatics Advances
184 papers in training set
Top 5%
0.7%
24
Computational and Structural Biotechnology Journal
216 papers in training set
Top 10%
0.7%
25
Journal of Bioinformatics and Systems Biology
14 papers in training set
Top 0.8%
0.7%
26
Journal of Molecular Biology
217 papers in training set
Top 4%
0.7%
27
Cancers
200 papers in training set
Top 5%
0.7%
28
G3 Genes|Genomes|Genetics
351 papers in training set
Top 3%
0.7%
29
Genome Biology
555 papers in training set
Top 9%
0.5%
30
Biostatistics
21 papers in training set
Top 0.2%
0.5%