Strategies for addressing pseudoreplication in multi-patient scRNA-seq data

Malfait, M.; Gilis, J.; Van den Berge, K.; Assefa, A. T.; Verbist, B.; Clement, L.

2024-06-17 bioinformatics

10.1101/2024.06.15.599144 bioRxiv

Show abstract

The rapidly evolving field of single-cell transcriptomics has provided a powerful means for understanding cellular heterogeneity. Large-scale studies with multiple biological samples hold promise for discovering differentially expressed biomarkers with a higher level of confidence through a better characterization of the target population. However, the hierarchical nature of these experiments introduces a significant challenge for downstream statistical analysis. Indeed, despite the availability of numerous differential expression methods, only a select few can accurately address the within-patient correlation of single-cell expression profiles. Furthermore, due to the high computational costs associated with some of these methods, their practical use is limited. In this manuscript, we undertake a comprehensive assessment of different strategies to address the hierarchical correlation structure in multi-sample scRNA-seq data. We employ synthetic data generated from a simulator that retains the original correlation structure of multi-patient data while making minimal assumptions, providing a robust platform for benchmarking method performance. Our analyses indicate that neglecting within-patient correlation jeopardizes type I error control. We show that, in line with some previous reports and in contrast with others, Poisson Generalized Estimation Equations provide a useful and flexible framework for addressing these issues. We also show that pseudobulk approaches outperform single-cell level methods across the board. In this work, we resolve the conflicting results regarding the utility of GEEs and their performance relative to pseudobulk approaches. As such, we provide valuable guidelines for researchers navigating the complex landscape of gene expression modeling, and offer insights on choosing the most appropriate methods based on the specific structure and design of their datasets.

Strategies for addressing pseudoreplication in multi-patient scRNA-seq data

Matching journals