Back

Interpretable Biological Sequence Clustering with iClust

Zhang, S.; Liu, X.; Lou, J.; Jiang, M.; He, Z.

2026-04-16 bioinformatics
10.64898/2026.04.13.718335 bioRxiv
Show abstract

Biological sequence clustering is a fundamental problem in bioinformatics, yet most existing methods mainly optimize clustering quality or efficiency while offering limited insight into why sequences are grouped together. This restricts their usefulness in downstream analysis, where representative sequences and clear cluster boundaries are often needed. To address this issue, we propose iClust, an interpretable clustering method that characterizes each cluster by a representative prototype and an adaptive radius. By adapting to local sequence structure rather than relying on a single global threshold, iClust produces clusters that are both meaningful and explainable. A final consolidation step further reduces tiny fragments and improves structural stability. Experiments on simulated and real biological sequence datasets show that iClust achieves competitive clustering performance while providing clearer cluster-level explanations than conventional threshold-based methods. In addition to its empirical impact as a practical clustering method for biological sequences, this article opens up new avenues for developing biological sequence clustering approaches from the viewpoint of interpretable machine learning.

Matching journals

The top 3 journals account for 50% of the predicted probability mass.

1
Bioinformatics
1061 papers in training set
Top 0.9%
26.1%
2
BMC Bioinformatics
383 papers in training set
Top 0.4%
17.7%
3
Genomics, Proteomics & Bioinformatics
171 papers in training set
Top 0.8%
8.3%
50% of probability mass above
4
Briefings in Bioinformatics
326 papers in training set
Top 1%
4.9%
5
PLOS Computational Biology
1633 papers in training set
Top 8%
4.3%
6
PLOS ONE
4510 papers in training set
Top 39%
3.6%
7
NAR Genomics and Bioinformatics
214 papers in training set
Top 0.9%
3.1%
8
Bioinformatics Advances
184 papers in training set
Top 2%
2.8%
9
Nucleic Acids Research
1128 papers in training set
Top 9%
1.9%
10
Nature Communications
4913 papers in training set
Top 51%
1.7%
11
Journal of Molecular Biology
217 papers in training set
Top 2%
1.5%
12
Journal of Chemical Information and Modeling
207 papers in training set
Top 2%
1.5%
13
Scientific Reports
3102 papers in training set
Top 62%
1.5%
14
Computational and Structural Biotechnology Journal
216 papers in training set
Top 6%
1.3%
15
IEEE Transactions on Computational Biology and Bioinformatics
17 papers in training set
Top 0.4%
1.2%
16
Communications Biology
886 papers in training set
Top 17%
1.0%
17
Advanced Science
249 papers in training set
Top 17%
0.8%
18
IEEE/ACM Transactions on Computational Biology and Bioinformatics
32 papers in training set
Top 0.6%
0.8%
19
Genome Biology
555 papers in training set
Top 7%
0.8%
20
BMC Genomics
328 papers in training set
Top 6%
0.8%
21
GigaScience
172 papers in training set
Top 3%
0.8%
22
Frontiers in Molecular Biosciences
100 papers in training set
Top 5%
0.7%
23
Genome Research
409 papers in training set
Top 4%
0.7%
24
IEEE Journal of Biomedical and Health Informatics
34 papers in training set
Top 2%
0.7%
25
Journal of Computational Biology
37 papers in training set
Top 0.6%
0.7%