Cell type composition drives patient stratification in single-cell RNA-seq cohorts
Halter, C.; Andreatta, M.; Carmona, S.
Show abstract
Early transcriptomic studies demonstrated that unsupervised analysis of bulk gene expression can reveal clinically meaningful patient subgroups. Single-cell RNA sequencing (scRNA-seq) provides high-resolution characterization of cellular heterogeneity and therefore enables more refined patient stratification. Several computational approaches have been proposed to summarize single-cell data into sample-level representations for cohort-level exploratory analyses. However, these methods generally do not explicitly account for the compositional nature of cell-type proportions. Based on eleven scRNA-seq cohorts across different biological conditions, we evaluated several state-of-the-art sample representation methods for their ability to recover known biological groupings in an unsupervised setting. Surprisingly, we found that baseline approaches based on cell-type composition and pseudobulk gene expression consistently matched or outperformed more complex methods while requiring orders of magnitude fewer computational resources. In particular, centered log-ratio-transformed cell-type proportions achieved the highest stratification performance and demonstrated robustness to batch effects. The stratification signal was frequently concentrated in a small subset of highly variable cell types, and performance was robust across diverse cell type annotation strategies. Altogether, these results suggest that clinically relevant inter-sample variation in scRNA-seq cohorts is largely driven by differences in cell-type composition. Importantly, compositional representations directly link cohort-level structure to specific cell populations, enabling mechanistic interpretation and facilitating clinical translation. We provide scECODA, an open-source R package for scalable and interpretable cohort-level Exploratory COmpositional Data Analysis of scRNA-seq data, and establish cell-type compositional representations as a powerful and interpretable baseline for unsupervised patient stratification.
Matching journals
The top 6 journals account for 50% of the predicted probability mass.