Evaluating Single-Cell Perturbation Response Models Is Far from Straightforward
Heidari, M.; Karimpour, M.; Srivatsa, S.; Montazeri, H.
Show abstract
Predicting cellular responses to genetic and chemical perturbations remains a central challenge in single-cell biology and a key step toward building in silico virtual cells. The rapid growth of perturbation datasets and advances in deep-learning models have raised expectations for accurate and generalizable prediction. We show that these expectations are overly optimistic, largely due to the failure modes of existing evaluation metrics. In this study, using cross-splitting, controlled noise experiments, and synthetic data, we systematically evaluate both prediction models and evaluation metrics. We demonstrate that widely used metrics, including correlation-based measures and common distributional distances, are strongly influenced by scale, sparsity, and dimensionality, often misrepresenting model performance. In particular, the Wasserstein distance fails in high-dimensional gene expression spaces under variance scaling, while the Energy distance can overlook disruptions in gene-gene dependencies. Our analyses further reveal that complex deep learning models often underperform simple baselines and remain far from empirical performance bounds across multiple chemical perturbation datasets. Together, our framework exposes critical pitfalls, establishes robust evaluation guidelines, and provides a foundation for trustworthy benchmarking toward reliable virtual-cell models.
Matching journals
The top 3 journals account for 50% of the predicted probability mass.