Fast Set Operations for Compact k-mer Sets
Alanko, J.; Depuydt, L.; MARCHET, C.; Puglisi, S. J.
Show abstract
The k-mer spectrum of a set of sequences is the set of k-length substrings the sequences contain. This lossy representation of sequence content pervades modern genomics. Recently, the spectral Burrows-Wheeler transform (SBWT) has emerged as a space-efficient representation of k-spectra that also supports efficient k-mer lookup queries and, more generally, easy navigation of the de Bruijn graph of the k-spectrum. In this paper, we examine primitive set operations, such as intersection, union, and set difference, on SBWT-encoded k-spectra and show that these operations can be supported efficiently. Moreover, efficient merging leads directly to a new memory-efficient algorithm for SBWT construction, which was able to build the SBWT for the 661K bacterial dataset containing 88 billion distinct k-mers in 50 hours using 186 GiB of RAM and 112 GiB of disk space. Given the pervasiveness of k-mer sets in genomics and the continued rapid growth of genomic databases, our work opens the door to a wide array of future applications that manipulate and reason about genomic data by dealing directly with simultaneously compact and searchable k-mer set representations offered by the SBWT. 2012 ACM Subject ClassificationTheory of computation [->] Design and analysis of algorithms Digital Object Identifier10.4230/LIPIcs.WABI.2026. Supplementary MaterialSoftware (Source Code): https://github.com/LoreDepuydt/sbwt-set-operations FundingThis work has benefited from funding from the French State under the France 2030 program, reference ANR-21-IDES-0006. The European Metropolis of Lille and the University of Lille are also acknowledged for their funding and support of the project WILL-CHAIRES-25-001-BOSSA.
Matching journals
The top 1 journal accounts for 50% of the predicted probability mass.