Back

On the benchmarking of clustering algorithms and hyperparameter influence for cell type detection in single-cell RNA sequencing data.

Szmigiel, A.; Gesteira Costa Filho, I.; Campello, R. J. G. B.

2026-05-17 bioinformatics
10.1101/2025.08.20.671270 bioRxiv
Show abstract

Clustering single-cell RNA-seq (scRNA-seq) data and related protocols remains a major challenge due to high dimensionality, sparsity, and noise. Despite numerous benchmarking studies aiming to identify the most suitable clustering methods, many suffer from methodological flaws that can undermine their conclusions. A major challenge in benchmarking is selecting representative datasets that cover the diversity of scRNA-seq experiments and include laboratory-verified labels for reliable evaluation. Consistent preprocessing of all inputs to benchmarked algorithms is crucial, as it significantly impacts performance. Beyond selecting an algorithm, a thorough exploration of hyperparameters is also essential to assess robustness and identify configurations that maximize performance. We focus on proposing an improved benchmarking framework that addresses common methodological issues in prior studies. We illustrate our proposed methodology in a case study comparing the classic Leiden and Louvain clustering algorithms with extensive hyperparameters exploration on a carefully curated collection of real gold standard datasets. By evaluating clustering performance across different hyper-parameter selection scenarios, we show that benchmarking results can be misleading, either overestimating or underestimating performance depending on how the hyperparameter space is explored. In our illustrative case study, benchmarking results do not reveal any practically relevant performance differences between the Louvain and Leiden algorithms. In contrast, we show that overlooked factors such as graph construction and quality functions critically influence clustering outcomes, particularly un-der suboptimal settings of numerical hyperparameters--the neighbor-hood size k used for similarity graph construction and the resolution hyperparameter in graph-based clustering algorithms. While noticeable trends have been observed in terms of how different (dis)similarity functions affect performance, the impact of this choice is limited and, to some extent, overridden by the graph-building approach. Across different graphs, there is a noticeable trade-off between achieving optimal performance with ideally tuned numerical hyperparameters and maintaining robustness under more realistic, unsupervised, and suboptimal settings. All in all, the analysis of our illustrative benchmarking case study offers clear guidance and objective recommendations for practitioners in the field. Most importantly, as the main contribution of this manuscript, our proposed framework sets a foundation for more reliable scRNA-seq clustering evaluation and benchmarking in future studies.

Matching journals

The top 5 journals account for 50% of the predicted probability mass.

1
BMC Bioinformatics
383 papers in training set
Top 0.3%
18.7%
2
Bioinformatics
1061 papers in training set
Top 3%
10.1%
3
Briefings in Bioinformatics
326 papers in training set
Top 0.5%
8.4%
4
NAR Genomics and Bioinformatics
214 papers in training set
Top 0.1%
8.4%
5
PLOS Computational Biology
1633 papers in training set
Top 5%
6.8%
50% of probability mass above
6
GigaScience
172 papers in training set
Top 0.4%
4.0%
7
Genome Biology
555 papers in training set
Top 2%
3.6%
8
Bioinformatics Advances
184 papers in training set
Top 2%
2.9%
9
PLOS ONE
4510 papers in training set
Top 43%
2.9%
10
Frontiers in Bioinformatics
45 papers in training set
Top 0.1%
2.4%
11
Computational and Structural Biotechnology Journal
216 papers in training set
Top 4%
1.9%
12
Scientific Reports
3102 papers in training set
Top 53%
1.9%
13
Nucleic Acids Research
1128 papers in training set
Top 9%
1.9%
14
PeerJ
261 papers in training set
Top 7%
1.7%
15
Genome Research
409 papers in training set
Top 2%
1.7%
16
Journal of Computational Biology
37 papers in training set
Top 0.2%
1.5%
17
BMC Genomics
328 papers in training set
Top 3%
1.3%
18
Biology Methods and Protocols
53 papers in training set
Top 2%
0.9%
19
iScience
1063 papers in training set
Top 26%
0.9%
20
Nature Communications
4913 papers in training set
Top 60%
0.9%
21
Journal of Proteome Research
215 papers in training set
Top 2%
0.8%
22
Analytical Chemistry
205 papers in training set
Top 3%
0.7%
23
Molecular & Cellular Proteomics
158 papers in training set
Top 2%
0.6%
24
Frontiers in Genetics
197 papers in training set
Top 11%
0.6%
25
Computers in Biology and Medicine
120 papers in training set
Top 5%
0.6%
26
International Journal of Molecular Sciences
453 papers in training set
Top 19%
0.5%
27
Life Science Alliance
263 papers in training set
Top 3%
0.5%
28
Cell Reports Methods
141 papers in training set
Top 7%
0.5%
29
Methods
29 papers in training set
Top 0.9%
0.5%