kache-hash: A dynamic, concurrent, and cache-efficient hash table for streaming k-mer operations
Khan, J.; Patro, R.; Pandey, P.
Show abstract
MotivationHash tables are fundamental to computational genomics, where keys are often k-mers--fixed-length substrings that exhibit a "streaming" property: consecutive k-mers share k-1 nucleotides and are processed in order. Existing static data structures exploit this locality but cannot support dynamic updates, while state-of-the-art concurrent hash tables support dynamic operations but ignore k-mer locality. ResultsWe introduce kache-hash, the first dynamic, concurrent, and resizable hash table that exploits k-mer locality. kache-hash builds on Iceberg hashing--a multi-level design achieving stability and low associativity--but replaces generic hashing with minimizer-based hashing, ensuring that consecutive k-mers map to the same buckets. This keeps frequently accessed buckets cache-resident during streaming operations. On the human genome, kache-hash achieves 1.58-2.62x higher insertion throughput than IcebergHT and up to 6.1x higher query throughput, while incurring 7.39x fewer cache misses. kache-hash scales near-linearly to 16 threads and supports dynamic resizing without sacrificing locality. Our theoretical analysis proves that streaming k-mer operations achieve [O](1/r) amortized cache misses per operation, where r is the minimizer run length, explaining the substantial performance gains over general-purpose hash tables. Availabilitykache-hash is implemented in C++20 and is available at https://github.com/jamshed/kache-hash. Contactp.pandey@northeastern.edu Supplementary informationSupplementary material are available for this manuscript.
Matching journals
The top 1 journal accounts for 50% of the predicted probability mass.