Min-frame transformation enables more sensitive viral genome alignment
Doughty, R. D.; Banerjee, A.; Kille, B.; Warnow, T.; Treangen, T. J.
Show abstract
MotivationMaximal unique matches (MUMs) are a fundamental primitive in genome comparison, where they serve as high-confidence anchors for downstream multiple genome alignment. However, because MUMs rely on exact string matching, their effectiveness degrades with increased genome divergence and larger sets of genomes, inhibiting their ability to recover long homologous regions and reducing the number of base pairs covered by the multiple genome alignment. Additionally, existing approaches that improve robustness to mutation, such as spaced seeds or translated alignment methods, introduce trade-offs in specificity, scalability, or computational complexity. MethodsTo address this gap, we introduce the Min-Frame Transformation (MFT), a deterministic encoding of nucleotide sequences to sequences over a transformed alphabet that preserves the coordinate structure of the original sequence. At each position, the MFT selects a k-mer from a local window according to a fixed global ordering and assigns it a character in the transformed alphabet via a predefined mapping. This process captures local sequence context and can mask the impact of mutations, increasing the likelihood that homologous regions remain detectable as exact matches. The resulting transformed sequences can be indexed using standard string data structures, such as suffix arrays and suffix trees, enabling efficient extraction of MUMs without modifying existing algorithms. ImpactThe MFT is a novel computational approach for improving the robustness of MUM-based seeding for genome alignment by producing longer and more contiguous matches that span a greater fraction of the genome, leading to improved alignment coverage and SNP recall. Altogether, these improvements have the potential to result in improvements for downstream viral genome analysis applications such as phylogenetic inference and transmission analysis. FundingTandy Warnow: NSF grant 2316233 Todd J. Treangen: NSF grants 2126387, 2239114, NIH grants U19-AI144297, P01-AI152999
Matching journals
The top 1 journal accounts for 50% of the predicted probability mass.