Back

Cell type composition drives patient stratification in single-cell RNA-seq cohorts

Halter, C.; Andreatta, M.; Carmona, S.

2026-03-31 bioinformatics
10.64898/2026.03.27.714811 bioRxiv
Show abstract

Early transcriptomic studies demonstrated that unsupervised analysis of bulk gene expression can reveal clinically meaningful patient subgroups. Single-cell RNA sequencing (scRNA-seq) provides high-resolution characterization of cellular heterogeneity and therefore enables more refined patient stratification. Several computational approaches have been proposed to summarize single-cell data into sample-level representations for cohort-level exploratory analyses. However, these methods generally do not explicitly account for the compositional nature of cell-type proportions. Based on eleven scRNA-seq cohorts across different biological conditions, we evaluated several state-of-the-art sample representation methods for their ability to recover known biological groupings in an unsupervised setting. Surprisingly, we found that baseline approaches based on cell-type composition and pseudobulk gene expression consistently matched or outperformed more complex methods while requiring orders of magnitude fewer computational resources. In particular, centered log-ratio-transformed cell-type proportions achieved the highest stratification performance and demonstrated robustness to batch effects. The stratification signal was frequently concentrated in a small subset of highly variable cell types, and performance was robust across diverse cell type annotation strategies. Altogether, these results suggest that clinically relevant inter-sample variation in scRNA-seq cohorts is largely driven by differences in cell-type composition. Importantly, compositional representations directly link cohort-level structure to specific cell populations, enabling mechanistic interpretation and facilitating clinical translation. We provide scECODA, an open-source R package for scalable and interpretable cohort-level Exploratory COmpositional Data Analysis of scRNA-seq data, and establish cell-type compositional representations as a powerful and interpretable baseline for unsupervised patient stratification.

Matching journals

The top 6 journals account for 50% of the predicted probability mass.

1
Genome Biology
555 papers in training set
Top 0.3%
10.4%
2
Nature Biotechnology
147 papers in training set
Top 0.7%
10.1%
3
Nature Communications
4913 papers in training set
Top 21%
9.1%
4
Genome Medicine
154 papers in training set
Top 0.5%
9.1%
5
Cell Systems
167 papers in training set
Top 2%
6.8%
6
Genome Research
409 papers in training set
Top 0.4%
6.3%
50% of probability mass above
7
Nature Genetics
240 papers in training set
Top 2%
4.3%
8
Nucleic Acids Research
1128 papers in training set
Top 4%
4.3%
9
Nature Methods
336 papers in training set
Top 2%
4.2%
10
PLOS Computational Biology
1633 papers in training set
Top 9%
3.7%
11
Briefings in Bioinformatics
326 papers in training set
Top 3%
2.4%
12
Cell Genomics
162 papers in training set
Top 3%
1.8%
13
Scientific Reports
3102 papers in training set
Top 58%
1.7%
14
Cell Reports Methods
141 papers in training set
Top 3%
1.5%
15
Bioinformatics
1061 papers in training set
Top 8%
1.5%
16
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 36%
1.3%
17
Cell Reports Medicine
140 papers in training set
Top 5%
1.3%
18
Science Advances
1098 papers in training set
Top 23%
1.2%
19
Advanced Science
249 papers in training set
Top 14%
1.2%
20
iScience
1063 papers in training set
Top 22%
1.2%
21
Communications Biology
886 papers in training set
Top 19%
0.9%
22
Nature Cell Biology
99 papers in training set
Top 4%
0.8%
23
Nature Biomedical Engineering
42 papers in training set
Top 2%
0.7%
24
NAR Genomics and Bioinformatics
214 papers in training set
Top 4%
0.7%
25
BMC Bioinformatics
383 papers in training set
Top 7%
0.7%
26
Cell Reports
1338 papers in training set
Top 34%
0.7%
27
The American Journal of Human Genetics
206 papers in training set
Top 4%
0.7%
28
eLife
5422 papers in training set
Top 61%
0.6%
29
Nature Machine Intelligence
61 papers in training set
Top 4%
0.6%
30
Science
429 papers in training set
Top 21%
0.6%