Back

DeepMinimizer: A Differentiable Framework for Optimizing Sequence-Specific Minimizer Schemes

Hoang, M.; Zheng, H.; Kingsford, C.

2022-02-19 bioinformatics
10.1101/2022.02.17.480870 bioRxiv
Show abstract

Minimizers are k-mer sampling schemes designed to generate sketches for large sequences that preserve sufficiently long matches between sequences. Despite their widespread application, learning an effective minimizer scheme with optimal sketch size is still an open question. Most work in this direction focuses on designing schemes that work well on expectation over random sequences, which have limited applicability to many practical tools. On the other hand, several methods have been proposed to construct minimizer schemes for a specific target sequence. These methods, however, require greedy approximations to solve an intractable discrete optimization problem on the permutation space of k-mer orderings. To address this challenge, we propose: (a) a reformulation of the combinatorial solution space using a deep neural network re-parameterization; and (b) a fully differentiable approximation of the discrete objective. We demonstrate that our framework, DO_SCPLOWEEPC_SCPLOWMO_SCPLOWINIMIZERC_SCPLOW, discovers minimizer schemes that significantly outperform state-of-the-art constructions on genomic sequences.

Matching journals

The top 4 journals account for 50% of the predicted probability mass.

1
Genome Research
409 papers in training set
Top 0.1%
22.4%
2
Cell Systems
167 papers in training set
Top 0.4%
17.4%
3
Nature Biotechnology
147 papers in training set
Top 0.7%
10.0%
4
Nature Communications
4913 papers in training set
Top 22%
8.4%
50% of probability mass above
5
Bioinformatics
1061 papers in training set
Top 4%
6.8%
6
Nature Methods
336 papers in training set
Top 3%
2.7%
7
Briefings in Bioinformatics
326 papers in training set
Top 3%
2.6%
8
Bioinformatics Advances
184 papers in training set
Top 2%
2.1%
9
Genome Biology
555 papers in training set
Top 3%
2.1%
10
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 28%
2.1%
11
PLOS Computational Biology
1633 papers in training set
Top 15%
1.9%
12
IEEE Transactions on Computational Biology and Bioinformatics
17 papers in training set
Top 0.2%
1.9%
13
Nature Genetics
240 papers in training set
Top 4%
1.7%
14
Algorithms for Molecular Biology
15 papers in training set
Top 0.1%
1.7%
15
Nature Computational Science
50 papers in training set
Top 0.7%
1.5%
16
iScience
1063 papers in training set
Top 18%
1.5%
17
Journal of Computational Biology
37 papers in training set
Top 0.3%
1.2%
18
Nucleic Acids Research
1128 papers in training set
Top 15%
0.9%
19
The American Journal of Human Genetics
206 papers in training set
Top 3%
0.8%
20
Molecular Biology and Evolution
488 papers in training set
Top 4%
0.7%
21
BMC Bioinformatics
383 papers in training set
Top 7%
0.7%
22
Scientific Reports
3102 papers in training set
Top 78%
0.6%