Design of Worst-Case-Optimal Spaced Seeds

Rahmann, S.; Zentgraf, J.

2023-11-21 bioinformatics

10.1101/2023.11.20.567826 bioRxiv

Show abstract

Read mapping (and alignment) is a fundamental problem in biological sequence analysis. For speed and computational efficiency, many popular read mappers tolerate only a few differences between the read and the corresponding part of the reference genome, which leads to reference bias: Reads with too many difference are not guaranteed to be mapped correctly or at all, because to even consider a genomic position, a sufficiently long exact match (seed) must exist. While pangenomes and their graph-based representations provide one way to avoid reference bias by enlarging the reference, we explore an orthogonal approach and consider stronger substitution-tolerant primitives, namely spaced seeds or gapped k-mers. Given two integers k [≤] w, one considers k selected positions, described by a mask, from each length-w window in a sequence. In the existing literature, masks with certain probabilistic guarantees have been designed for small values of k. Here, for the first time, we take a combinatorial approach from a worst-case perspective. For any mask, using integer linear programs, we find least favorable distributions of sequence changes in two different senses: (1) minimizing the number of unchanged windows; (2) minimizing the number of positions covered by unchanged windows. Then, among all masks of a given shape (k, w), we find the set of best masks that maximize these minima. As a result, we obtain highly robust masks, even for large numbers of changes. Their advantages are illustrated in two ways: First, we provide a new challenge dataset of simulated DNA reads, on which current methods like bwa-mem2, minimap2, or strobealign struggle to find seeds, and therefore cannot produce alignments against the human t2t reference genome, whereas we are able to find the correct location from a few unique spaced seeds. Second, we use real DNA data from the highly diverse human HLA region, which we are able to map correctly based on a few exactly matching spaced seeds of well-chosen masks, without evaluating alignments.

Design of Worst-Case-Optimal Spaced Seeds

Matching journals