Back

Strategies for addressing pseudoreplication in multi-patient scRNA-seq data

Malfait, M.; Gilis, J.; Van den Berge, K.; Assefa, A. T.; Verbist, B.; Clement, L.

2024-06-17 bioinformatics
10.1101/2024.06.15.599144 bioRxiv
Show abstract

The rapidly evolving field of single-cell transcriptomics has provided a powerful means for understanding cellular heterogeneity. Large-scale studies with multiple biological samples hold promise for discovering differentially expressed biomarkers with a higher level of confidence through a better characterization of the target population. However, the hierarchical nature of these experiments introduces a significant challenge for downstream statistical analysis. Indeed, despite the availability of numerous differential expression methods, only a select few can accurately address the within-patient correlation of single-cell expression profiles. Furthermore, due to the high computational costs associated with some of these methods, their practical use is limited. In this manuscript, we undertake a comprehensive assessment of different strategies to address the hierarchical correlation structure in multi-sample scRNA-seq data. We employ synthetic data generated from a simulator that retains the original correlation structure of multi-patient data while making minimal assumptions, providing a robust platform for benchmarking method performance. Our analyses indicate that neglecting within-patient correlation jeopardizes type I error control. We show that, in line with some previous reports and in contrast with others, Poisson Generalized Estimation Equations provide a useful and flexible framework for addressing these issues. We also show that pseudobulk approaches outperform single-cell level methods across the board. In this work, we resolve the conflicting results regarding the utility of GEEs and their performance relative to pseudobulk approaches. As such, we provide valuable guidelines for researchers navigating the complex landscape of gene expression modeling, and offer insights on choosing the most appropriate methods based on the specific structure and design of their datasets.

Matching journals

The top 6 journals account for 50% of the predicted probability mass.

1
PLOS Computational Biology
1633 papers in training set
Top 2%
14.8%
2
Bioinformatics
1061 papers in training set
Top 2%
14.4%
3
BMC Bioinformatics
383 papers in training set
Top 1%
7.2%
4
PLOS ONE
4510 papers in training set
Top 25%
6.8%
5
Biostatistics
21 papers in training set
Top 0.1%
4.3%
6
Nucleic Acids Research
1128 papers in training set
Top 6%
3.6%
50% of probability mass above
7
Computational and Structural Biotechnology Journal
216 papers in training set
Top 2%
3.1%
8
The Annals of Applied Statistics
15 papers in training set
Top 0.1%
2.5%
9
Frontiers in Genetics
197 papers in training set
Top 3%
2.1%
10
PeerJ
261 papers in training set
Top 5%
2.1%
11
Physical Biology
43 papers in training set
Top 0.8%
2.1%
12
Statistics in Medicine
34 papers in training set
Top 0.1%
2.1%
13
Briefings in Bioinformatics
326 papers in training set
Top 3%
2.1%
14
Genome Research
409 papers in training set
Top 2%
1.8%
15
NAR Genomics and Bioinformatics
214 papers in training set
Top 2%
1.8%
16
Biometrics
22 papers in training set
Top 0.1%
1.7%
17
Scientific Reports
3102 papers in training set
Top 59%
1.7%
18
iScience
1063 papers in training set
Top 16%
1.7%
19
BMC Genomics
328 papers in training set
Top 4%
1.2%
20
Genome Biology
555 papers in training set
Top 5%
1.2%
21
npj Systems Biology and Applications
99 papers in training set
Top 2%
0.9%
22
PLOS Genetics
756 papers in training set
Top 14%
0.8%
23
Journal of Bioinformatics and Systems Biology
14 papers in training set
Top 0.7%
0.7%
24
Bioinformatics Advances
184 papers in training set
Top 5%
0.7%
25
Journal of Computational Biology
37 papers in training set
Top 0.7%
0.6%
26
Cell Systems
167 papers in training set
Top 14%
0.6%
27
NeuroImage
813 papers in training set
Top 6%
0.6%