Back

Fast and alignment-free flavivirus classification from low-coverage genomes

Shahid, A.; Ulrich, J.-U.; Kuehnert, D.

2026-02-20 bioinformatics
10.64898/2026.02.20.706982 bioRxiv
Show abstract

High genomic variability among viral species makes sequence classification highly dependent on multiple sequence alignment (MSA) methods, which are both computationally intensive and sensitive to data quality issues. To provide a more efficient and robust alternative, we developed DiCNN-UniK, a Dual-Input Convolutional Neural Network (DiCNN) utilizing unique k-mer signatures and universal k-mer libraries to generate novel and direct embeddings. Instead of relying on k-mer frequency patterns, DiCNN-UniK directly leverages k-mer embedding information, which provides a clear picture of local genomic context. This architecture is designed to handle full-length genomic sequences, overcoming the restrictive 512-token limit common in many genomic foundation models. Trained on Flaviviruses, our model shows high sensitivity, robustness, and reliability, achieving an accuracy of 99% on an independent test set. DiCNN-UniK is trained on full-genome data and is able to handle partial genomic sequences without preprocessing, maintaining high accuracy and precision with genomic coverage as low as 20%. DiCNN-UniK currently stands as the best available model for the classification of flaviviruses, offering a sensitive, robust, and reliable solution for sequence analysis under real-world genomic coverage and data quality scenarios.

Matching journals

The top 8 journals account for 50% of the predicted probability mass.

1
Briefings in Bioinformatics
326 papers in training set
Top 0.3%
10.4%
2
Nature Communications
4913 papers in training set
Top 18%
10.1%
3
Bioinformatics
1061 papers in training set
Top 3%
7.2%
4
Genome Biology
555 papers in training set
Top 1%
6.4%
5
Nature Biotechnology
147 papers in training set
Top 2%
4.8%
6
Nucleic Acids Research
1128 papers in training set
Top 4%
4.3%
7
Advanced Science
249 papers in training set
Top 6%
3.6%
8
PLOS Computational Biology
1633 papers in training set
Top 10%
3.6%
50% of probability mass above
9
BMC Bioinformatics
383 papers in training set
Top 3%
3.1%
10
Bioinformatics Advances
184 papers in training set
Top 2%
2.7%
11
Genome Medicine
154 papers in training set
Top 3%
2.6%
12
Nature Machine Intelligence
61 papers in training set
Top 1%
2.1%
13
Cell Systems
167 papers in training set
Top 6%
2.1%
14
Nature Methods
336 papers in training set
Top 4%
2.1%
15
Genome Research
409 papers in training set
Top 2%
2.1%
16
Scientific Reports
3102 papers in training set
Top 53%
1.9%
17
GigaScience
172 papers in training set
Top 1%
1.7%
18
Genomics, Proteomics & Bioinformatics
171 papers in training set
Top 3%
1.7%
19
Patterns
70 papers in training set
Top 0.9%
1.7%
20
NAR Genomics and Bioinformatics
214 papers in training set
Top 2%
1.7%
21
iScience
1063 papers in training set
Top 16%
1.7%
22
Nature Computational Science
50 papers in training set
Top 0.7%
1.5%
23
PLOS ONE
4510 papers in training set
Top 58%
1.3%
24
Virus Evolution
140 papers in training set
Top 0.9%
1.3%
25
Journal of Chemical Information and Modeling
207 papers in training set
Top 2%
1.2%
26
Communications Biology
886 papers in training set
Top 14%
1.2%
27
Viruses
318 papers in training set
Top 6%
0.7%