Back

SCANBIT facilitates identification of tumor cell populations in scRNAseq data using pseudobulked SNV calls

Cannon, M. V.; Gust, M. J.; Gross, A. C.; Cam, M.; Reinecke, J. B.; Jimenez Garcia, L.; Strawser, C. H.; Ryan, L.; Sammons, M.; Zhang, C.-Z.; Roberts, R. D.

2026-01-28 bioinformatics
10.64898/2026.01.27.701834 bioRxiv
Show abstract

MotivationSingle cell RNAseq (scRNAseq) is an ideal tool to characterize the heterogeneity within the tumor microenvironment, however, accurate identification of tumor cells can be a challenge. Reference-based methods can be inaccurate, if reference datasets are even available. Current purpose-built methods can be inaccurate, particularly with highly heterogeneous tumor types. Improved methods are needed. We explored the use of genetic variants to distinguish tumor from normal cells within scRNAseq data. ResultsWe characterized the limitations inherent to calling variants from scRNAseq data, quantifying how data sparsity precludes genetic distance calculation between single cells. As a novel workaround, we pooled data from transcriptionally similar cell clusters to call high quality variants and then calculated pairwise differences between cell populations and performed hierarchical clustering. We quantified confidence in genetic divergence between tumor and normal cell populations using bootstrapping. We performed extensive validation to assess accurate identification of tumor cells using ground-truth datasets. Application of our method to human scRNAseq samples highlighted the utility of our approach and revealed how mutational burden influences successful tumor cell identification. Improved cell type assignment in scRNAseq data will facilitate analysis of tumor samples and, in turn, accelerate our understanding of the mechanisms underlying tumor progression and reveal potential biological vulnerabilities that can be exploited to develop improved treatment options. Availability and implementationOur method is publicly available as an R package: SCANBIT (Single Cell Altered Nucleotide Based Inference of Tumor) https://github.com/kidcancerlab/scanBit.

Matching journals

The top 6 journals account for 50% of the predicted probability mass.

1
Bioinformatics
1061 papers in training set
Top 1%
18.7%
2
BMC Bioinformatics
383 papers in training set
Top 0.3%
18.6%
3
PLOS Computational Biology
1633 papers in training set
Top 8%
4.2%
4
Genome Medicine
154 papers in training set
Top 2%
4.0%
5
NAR Genomics and Bioinformatics
214 papers in training set
Top 0.5%
4.0%
6
Nucleic Acids Research
1128 papers in training set
Top 6%
3.6%
50% of probability mass above
7
Scientific Reports
3102 papers in training set
Top 40%
3.2%
8
GigaScience
172 papers in training set
Top 0.6%
3.1%
9
Nature Communications
4913 papers in training set
Top 43%
2.9%
10
Cancer Research Communications
46 papers in training set
Top 0.2%
2.6%
11
Frontiers in Bioinformatics
45 papers in training set
Top 0.1%
2.4%
12
Nature Biotechnology
147 papers in training set
Top 3%
2.4%
13
Briefings in Bioinformatics
326 papers in training set
Top 3%
2.1%
14
JCO Clinical Cancer Informatics
18 papers in training set
Top 0.4%
1.9%
15
Genome Biology
555 papers in training set
Top 4%
1.7%
16
Bioinformatics Advances
184 papers in training set
Top 3%
1.7%
17
Biology Methods and Protocols
53 papers in training set
Top 1%
1.5%
18
BMC Genomics
328 papers in training set
Top 4%
1.2%
19
Cell Reports Methods
141 papers in training set
Top 3%
1.2%
20
NAR Cancer
36 papers in training set
Top 0.1%
1.1%
21
Frontiers in Genetics
197 papers in training set
Top 7%
1.1%
22
PLOS ONE
4510 papers in training set
Top 63%
0.9%
23
npj Precision Oncology
48 papers in training set
Top 1.0%
0.9%
24
Clinical Chemistry
22 papers in training set
Top 0.8%
0.8%
25
BMC Medical Genomics
36 papers in training set
Top 1%
0.8%
26
Communications Biology
886 papers in training set
Top 21%
0.8%
27
Cancer Research
116 papers in training set
Top 3%
0.7%
28
BioData Mining
15 papers in training set
Top 0.9%
0.7%
29
The Journal of Molecular Diagnostics
36 papers in training set
Top 0.5%
0.7%
30
Nature Methods
336 papers in training set
Top 7%
0.6%