Fast and alignment-free flavivirus classification from low-coverage genomes
Shahid, A.; Ulrich, J.-U.; Kuehnert, D.
Show abstract
High genomic variability among viral species makes sequence classification highly dependent on multiple sequence alignment (MSA) methods, which are both computationally intensive and sensitive to data quality issues. To provide a more efficient and robust alternative, we developed DiCNN-UniK, a Dual-Input Convolutional Neural Network (DiCNN) utilizing unique k-mer signatures and universal k-mer libraries to generate novel and direct embeddings. Instead of relying on k-mer frequency patterns, DiCNN-UniK directly leverages k-mer embedding information, which provides a clear picture of local genomic context. This architecture is designed to handle full-length genomic sequences, overcoming the restrictive 512-token limit common in many genomic foundation models. Trained on Flaviviruses, our model shows high sensitivity, robustness, and reliability, achieving an accuracy of 99% on an independent test set. DiCNN-UniK is trained on full-genome data and is able to handle partial genomic sequences without preprocessing, maintaining high accuracy and precision with genomic coverage as low as 20%. DiCNN-UniK currently stands as the best available model for the classification of flaviviruses, offering a sensitive, robust, and reliable solution for sequence analysis under real-world genomic coverage and data quality scenarios.
Matching journals
The top 8 journals account for 50% of the predicted probability mass.