Back

Fast Set Operations for Compact k-mer Sets

Alanko, J.; Depuydt, L.; MARCHET, C.; Puglisi, S. J.

2026-05-27 bioinformatics
10.64898/2026.05.24.727514 bioRxiv
Show abstract

The k-mer spectrum of a set of sequences is the set of k-length substrings the sequences contain. This lossy representation of sequence content pervades modern genomics. Recently, the spectral Burrows-Wheeler transform (SBWT) has emerged as a space-efficient representation of k-spectra that also supports efficient k-mer lookup queries and, more generally, easy navigation of the de Bruijn graph of the k-spectrum. In this paper, we examine primitive set operations, such as intersection, union, and set difference, on SBWT-encoded k-spectra and show that these operations can be supported efficiently. Moreover, efficient merging leads directly to a new memory-efficient algorithm for SBWT construction, which was able to build the SBWT for the 661K bacterial dataset containing 88 billion distinct k-mers in 50 hours using 186 GiB of RAM and 112 GiB of disk space. Given the pervasiveness of k-mer sets in genomics and the continued rapid growth of genomic databases, our work opens the door to a wide array of future applications that manipulate and reason about genomic data by dealing directly with simultaneously compact and searchable k-mer set representations offered by the SBWT. 2012 ACM Subject ClassificationTheory of computation [->] Design and analysis of algorithms Digital Object Identifier10.4230/LIPIcs.WABI.2026. Supplementary MaterialSoftware (Source Code): https://github.com/LoreDepuydt/sbwt-set-operations FundingThis work has benefited from funding from the French State under the France 2030 program, reference ANR-21-IDES-0006. The European Metropolis of Lille and the University of Lille are also acknowledged for their funding and support of the project WILL-CHAIRES-25-001-BOSSA.

Matching journals

The top 1 journal accounts for 50% of the predicted probability mass.

1
Bioinformatics
1061 papers in training set
Top 0.2%
52.9%
50% of probability mass above
2
Genome Research
409 papers in training set
Top 0.2%
8.3%
3
Cell Systems
167 papers in training set
Top 2%
4.9%
4
iScience
1063 papers in training set
Top 4%
3.6%
5
BMC Bioinformatics
383 papers in training set
Top 3%
3.6%
6
Bioinformatics Advances
184 papers in training set
Top 2%
3.3%
7
Nature Communications
4913 papers in training set
Top 49%
1.9%
8
Algorithms for Molecular Biology
15 papers in training set
Top 0.1%
1.7%
9
PLOS Computational Biology
1633 papers in training set
Top 17%
1.5%
10
Nature Biotechnology
147 papers in training set
Top 5%
1.4%
11
Genome Biology
555 papers in training set
Top 6%
1.2%
12
Nature Methods
336 papers in training set
Top 6%
0.9%
13
NAR Genomics and Bioinformatics
214 papers in training set
Top 3%
0.8%
14
Journal of Computational Biology
37 papers in training set
Top 0.5%
0.8%
15
Frontiers in Molecular Biosciences
100 papers in training set
Top 5%
0.7%
16
Nature Computational Science
50 papers in training set
Top 2%
0.7%
17
Nucleic Acids Research
1128 papers in training set
Top 20%
0.7%
18
Molecular Biology and Evolution
488 papers in training set
Top 5%
0.7%
19
Briefings in Bioinformatics
326 papers in training set
Top 7%
0.7%
20
Frontiers in Genetics
197 papers in training set
Top 11%
0.7%
21
IEEE Transactions on Computational Biology and Bioinformatics
17 papers in training set
Top 0.8%
0.7%
22
Scientific Reports
3102 papers in training set
Top 79%
0.5%