Back

Generating minimum-density minimizers

Shur, A.; Tziony, I.; Orenstein, Y.

2026-01-28 bioinformatics
10.64898/2026.01.25.701585 bioRxiv
Show abstract

Minimizers are sampling schemes which are ubiquitous in almost any high-throughput sequencing analysis. Assuming a fixed alphabet of size{sigma} , a minimizer is defined by two positive integers k, w and a linear order{rho} on k-mers. A sequence is processed by a sliding window algorithm that chooses in each window of length w + k - 1 its minimal k-mer with respect to{rho} . A key characteristic of a minimizer is its density, which is the expected frequency of chosen k-mers among all k-mers in a random infinite{sigma} -ary sequence. Minimizers of smaller density are preferred as they produce smaller samples, which lead to reduced runtime and memory usage in downstream applications. While the hardness of finding a minimizer of minimum density for given input parameters ({sigma}, k, w) is unknown, it has a huge search space of ({sigma}k)! and there is no known algorithm apart from a trivial brute-force search. In this paper, we tackle the minimum density problem for minimizers. We first formulate this problem as an ILP of size{Theta} (w{sigma}w+k), which has worst-case solution time that is doubly-exponential in (k + w) under standard complexity assumptions. Our experiments show that an ILP solver terminates with an optimal solution only for very small k and w. We then present our main method, called OptMini, which computes an optimal minimizer in [Formula] time and thus is capable of processing large w values. In experiments, OptMini works much faster than the runtime predicts due to several additional tricks shrinking the search space without harming optimality. We use OptMini to compute minimum-density minimizers for ({sigma}, k) [isin] {(2, 2), (2, 3), (2, 4), (2, 5), (2, 6), (4, 2)} and w [isin] [2, 3{sigma}k], with the exception of certain w-ranges for k = 6 and the single case of k = 5, w = 2. Finally, we derive conclusions and insights regarding the density values as a function of w, patterns in optimal minimizer orders, and the relation between minimum-size universal hitting sets and minimum-density minimizers.

Matching journals

The top 2 journals account for 50% of the predicted probability mass.