Back

scTGCL: A Transformer-Based Graph Contrastive Learning Approach for Efficiently Clustering Single-Cell RNA-seq Data

Khan, M. S. A.; Kabir, M. H.; Faisal, M. M.

2026-03-31 bioinformatics
10.64898/2026.03.28.714542 bioRxiv
Show abstract

Single-cell RNA sequencing (scRNA-seq) enables characterization of cellular heterogeneity but clustering remains challenging due to high dimensionality, dropout induced sparsity, and technical noise. Existing graph-based and contrastive methods often rely on predefined similarity measures or suffer from high computational costs on large datasets. We propose single-cell Transformer-based Graph Contrastive Learning (scTGCL), a framework integrating multi-head self-attention with graph contrastive learning to learn robust cell representations. The model projects raw expression data into an embedding space and employs multi-head attention to adaptively learn weighted cell-cell graphs capturing diverse biological relationships. For contrastive augmentation, we apply random gene masking at the feature level and random edge dropping on attention matrices, simulating dropout and structural uncertainty. A symmetric contrastive loss maximizes agreement between original and augmented representations, while joint optimization with reconstruction and imputation losses preserves biological interpretability. Experiments on ten real scRNA-seq datasets demonstrate that scTGCL consistently outperforms nine state-of-the-art methods across clustering accuracy, normalized mutual information, and adjusted Rand index. Ablation studies validate each architectural component, and robustness analysis on simulated data confirms stable performance under varying dropout rates and differential expression levels. Furthermore, scTGCL exhibits superior computational efficiency, achieving substantially lower runtime on large scale datasets compared with existing approaches. The framework provides an accurate, efficient, and scalable solution for single-cell clustering. Source code and datasets are available at https://github.com/ShoaibAbdullahKhan/scTGCL.

Matching journals

The top 5 journals account for 50% of the predicted probability mass.

1
Nature Communications
4913 papers in training set
Top 10%
14.6%
2
Nature Methods
336 papers in training set
Top 0.6%
14.2%
3
Bioinformatics
1061 papers in training set
Top 2%
12.6%
4
Nature Biotechnology
147 papers in training set
Top 0.9%
8.3%
5
Genome Research
409 papers in training set
Top 0.3%
6.7%
50% of probability mass above
6
Nucleic Acids Research
1128 papers in training set
Top 3%
6.3%
7
Genome Biology
555 papers in training set
Top 1%
6.3%
8
Briefings in Bioinformatics
326 papers in training set
Top 1%
4.8%
9
Cell Systems
167 papers in training set
Top 7%
1.7%
10
Bioinformatics Advances
184 papers in training set
Top 3%
1.7%
11
Genome Medicine
154 papers in training set
Top 5%
1.7%
12
Cell Reports Methods
141 papers in training set
Top 3%
1.5%
13
Nature Genetics
240 papers in training set
Top 5%
1.5%
14
Advanced Science
249 papers in training set
Top 14%
1.2%
15
BMC Bioinformatics
383 papers in training set
Top 6%
1.2%
16
Nature Computational Science
50 papers in training set
Top 1.0%
1.2%
17
PLOS Computational Biology
1633 papers in training set
Top 20%
1.1%
18
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 40%
0.9%
19
Nature Machine Intelligence
61 papers in training set
Top 3%
0.9%
20
NAR Genomics and Bioinformatics
214 papers in training set
Top 3%
0.9%
21
iScience
1063 papers in training set
Top 35%
0.7%
22
Communications Biology
886 papers in training set
Top 27%
0.7%