Back

Design of Worst-Case-Optimal Spaced Seeds

Rahmann, S.; Zentgraf, J.

2023-11-21 bioinformatics
10.1101/2023.11.20.567826 bioRxiv
Show abstract

Read mapping (and alignment) is a fundamental problem in biological sequence analysis. For speed and computational efficiency, many popular read mappers tolerate only a few differences between the read and the corresponding part of the reference genome, which leads to reference bias: Reads with too many difference are not guaranteed to be mapped correctly or at all, because to even consider a genomic position, a sufficiently long exact match (seed) must exist. While pangenomes and their graph-based representations provide one way to avoid reference bias by enlarging the reference, we explore an orthogonal approach and consider stronger substitution-tolerant primitives, namely spaced seeds or gapped k-mers. Given two integers k [≤] w, one considers k selected positions, described by a mask, from each length-w window in a sequence. In the existing literature, masks with certain probabilistic guarantees have been designed for small values of k. Here, for the first time, we take a combinatorial approach from a worst-case perspective. For any mask, using integer linear programs, we find least favorable distributions of sequence changes in two different senses: (1) minimizing the number of unchanged windows; (2) minimizing the number of positions covered by unchanged windows. Then, among all masks of a given shape (k, w), we find the set of best masks that maximize these minima. As a result, we obtain highly robust masks, even for large numbers of changes. Their advantages are illustrated in two ways: First, we provide a new challenge dataset of simulated DNA reads, on which current methods like bwa-mem2, minimap2, or strobealign struggle to find seeds, and therefore cannot produce alignments against the human t2t reference genome, whereas we are able to find the correct location from a few unique spaced seeds. Second, we use real DNA data from the highly diverse human HLA region, which we are able to map correctly based on a few exactly matching spaced seeds of well-chosen masks, without evaluating alignments.

Matching journals

The top 3 journals account for 50% of the predicted probability mass.

1
Bioinformatics
1061 papers in training set
Top 0.7%
28.5%
2
Cell Systems
167 papers in training set
Top 0.5%
15.2%
3
Genome Research
409 papers in training set
Top 0.1%
10.4%
50% of probability mass above
4
Nature Biotechnology
147 papers in training set
Top 2%
4.1%
5
Nature Communications
4913 papers in training set
Top 37%
4.0%
6
Algorithms for Molecular Biology
15 papers in training set
Top 0.1%
3.7%
7
BMC Bioinformatics
383 papers in training set
Top 3%
3.2%
8
Genome Biology
555 papers in training set
Top 3%
2.8%
9
Nature Methods
336 papers in training set
Top 3%
2.4%
10
Bioinformatics Advances
184 papers in training set
Top 2%
2.2%
11
iScience
1063 papers in training set
Top 11%
1.9%
12
Nature Computational Science
50 papers in training set
Top 0.4%
1.9%
13
PLOS Computational Biology
1633 papers in training set
Top 15%
1.8%
14
Briefings in Bioinformatics
326 papers in training set
Top 4%
1.7%
15
Scientific Reports
3102 papers in training set
Top 63%
1.4%
16
IEEE Transactions on Computational Biology and Bioinformatics
17 papers in training set
Top 0.3%
1.4%
17
PLOS ONE
4510 papers in training set
Top 60%
1.3%
18
Genetics
225 papers in training set
Top 3%
0.9%
19
NAR Genomics and Bioinformatics
214 papers in training set
Top 3%
0.8%
20
Communications Biology
886 papers in training set
Top 22%
0.8%
21
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 43%
0.8%
22
Nucleic Acids Research
1128 papers in training set
Top 17%
0.8%
23
Molecular Biology and Evolution
488 papers in training set
Top 5%
0.7%
24
Journal of Computational Biology
37 papers in training set
Top 0.8%
0.5%