Back

A near-tight lower bound on the density of forward sampling schemes

Kille, B.; Groot Koerkamp, R.; McAdams, D.; Liu, A.; Treangen, T.

2024-09-11 bioinformatics
10.1101/2024.09.06.611668 bioRxiv
Show abstract

MotivationSampling k-mers is a ubiquitous task in sequence analysis algorithms. Sampling schemes such as the often-used random minimizer scheme are particularly appealing as they guarantee at least one k-mer is selected out of every w consecutive k-mers. Sampling fewer k-mers often leads to an increase in efficiency of downstream methods. Thus, developing schemes that have low density, i.e., have a small proportion of sampled k-mers, is an active area of research. After over a decade of consistent efforts in both decreasing the density of practical schemes and increasing the lower bound on the best possible density, there is still a large gap between the two. ResultsWe prove a near-tight lower bound on the density of forward sampling schemes, a class of schemes that generalizes minimizer schemes. For small w and k, we observe that our bound is tight when k {equiv} 1 (mod w). For large w and k, the bound can be approximated by [Formula]. Importantly, our lower bound implies that existing schemes are much closer to achieving optimal density than previously known. For example, with the current default minimap2 HiFi settings w = 19 and k = 19, we show that the best known scheme for these parameters, the double decycling-set-based minimizer of Pellow et al., is at most 3% denser than optimal, compared to the previous gap of at most 50%. Furthermore, when k {equiv} 1 (mod w) and the alphabet size{sigma} goes to {infty}, we show that mod-minimizers introduced by Groot Koerkamp and Pibiri achieve optimal density matching our lower bound. Availability and implementationMinimizer implementations: github.com/RagnarGrootKoerkamp/minimizers ILP and analysis: github.com/treangenlab/sampling-scheme-analysis

Matching journals

The top 2 journals account for 50% of the predicted probability mass.

1
Bioinformatics
1061 papers in training set
Top 0.3%
42.1%
2
Genome Research
409 papers in training set
Top 0.1%
10.7%
50% of probability mass above
3
Nature Communications
4913 papers in training set
Top 25%
7.2%
4
Algorithms for Molecular Biology
15 papers in training set
Top 0.1%
6.8%
5
Cell Systems
167 papers in training set
Top 2%
5.1%
6
PLOS Computational Biology
1633 papers in training set
Top 9%
3.8%
7
Nature Computational Science
50 papers in training set
Top 0.3%
2.2%
8
BMC Bioinformatics
383 papers in training set
Top 4%
2.2%
9
Nature Methods
336 papers in training set
Top 4%
1.8%
10
iScience
1063 papers in training set
Top 13%
1.8%
11
Nature Biotechnology
147 papers in training set
Top 5%
1.4%
12
Genome Biology
555 papers in training set
Top 5%
1.3%
13
Scientific Reports
3102 papers in training set
Top 68%
1.2%
14
PLOS ONE
4510 papers in training set
Top 62%
1.0%
15
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 42%
0.8%
16
Genetics
225 papers in training set
Top 4%
0.8%
17
Journal of Computational Biology
37 papers in training set
Top 0.7%
0.7%
18
IEEE Transactions on Computational Biology and Bioinformatics
17 papers in training set
Top 0.8%
0.7%
19
Bioinformatics Advances
184 papers in training set
Top 5%
0.7%
20
Briefings in Bioinformatics
326 papers in training set
Top 8%
0.5%
21
Genome Biology and Evolution
280 papers in training set
Top 2%
0.5%