Back

KmerSignificance Score: A discriminative and biologically-informed framework for viral k-mer prioritization

Lebatteux, D.; Corso, F.; Soudeyns, H.; Boucoiran, I.; Gantt, S.; Banire Diallo, A.

2026-05-21 bioinformatics
10.64898/2026.05.15.725339 bioRxiv
Show abstract

Distinguishing closely related viral strains requires identifying genomic regions where subtle sequence differences carry biological significance. While k-mer-based approaches offer computational efficiency for genome analysis, existing methods lack standardized frameworks for evaluating which k-mers are most informative. Current selection strategies focus primarily on statistical discriminative power without integrating biological relevance. We introduce KmerSignificance Score (KSS), a k-mer prioritization framework combining three components: an information-theoretic method measuring strain-distinguishing capacity, an optimized amino acid substitution matrix (MIYATA EVO) for mutation impact assessment, and protein-level functional importance scoring derived from UniProt annotations. KSS produces standardized scores in the [0, 1] interval, enabling direct cross-dataset comparison. The discriminative component achieved classification performance comparable or superior to all tested alternatives (mean F1 = 0.880 vs. 0.718-0.877 for six established methods) while additionally providing bounded scores with consistent empirical distributions for cross-dataset comparability. MIYATA EVO, optimized via genetic algorithm, improved biophysical property correlations by 28.4% over the original MIYATA matrix. Protein scoring on 17,470 viral proteins showed robust agreement with UniProt annotation scores (Kendall{tau} = 0.777) while revealing finer functional distinctions. Literature validation on SARS-CoV-2 (278,738 sequences, 19 variants), HIV-1 (12,223 sequences, 15 subtypes), and human cytomegalovirus (HCMV; 399-646 sequences, 4-8 genotypes) confirmed that high-scoring k-mers consistently map to established variant-defining mutations, subtype-specific polymorphisms, and genotype markers. KSS provides a standardized framework for viral k-mer prioritization with applications in variant surveillance, molecular epidemiology, and functional annotation. The tool is available at https://github.com/bioinfoUQAM/KmerSignificanceScore. Author summaryIdentifying genetic differences between closely related viral strains is essential for pandemic preparedness, vaccine development, and understanding disease outbreaks. With millions of viral genomes now sequenced, researchers need tools that can rapidly pinpoint which genomic differences matter most biologically, not just which are statistically distinctive. Current k-mer-based approaches identify patterns that distinguish viral strains but cannot assess whether those differences affect protein function or disease phenotype. We developed KmerSignificance Score (KSS), a framework that we designed to rank short genomic sequences by combining three types of evidence: how well they distinguish viral strains, how much the encoded amino acid changes affect protein properties, and how functionally important the affected protein is. We standardized the resulting scores on a 0-to-1 scale, allowing direct comparison across different viruses and studies. We validated our framework on three major human pathogens (SARS-CoV-2, HIV-1, and human cytomegalovirus) and found that top-scoring positions consistently correspond to sites with documented roles in immune evasion, drug resistance, viral fitness, and strain classification. Our framework can help prioritize genomic features for surveillance of emerging variants, guide experimental validation, and support molecular epidemiology.

Matching journals

The top 8 journals account for 50% of the predicted probability mass.

1
Genome Medicine
154 papers in training set
Top 0.5%
10.0%
2
Cell Systems
167 papers in training set
Top 1%
10.0%
3
Bioinformatics
1061 papers in training set
Top 3%
8.3%
4
Briefings in Bioinformatics
326 papers in training set
Top 0.9%
6.3%
5
Nature Communications
4913 papers in training set
Top 30%
6.2%
6
Virus Evolution
140 papers in training set
Top 0.3%
4.8%
7
BMC Bioinformatics
383 papers in training set
Top 2%
3.9%
8
Nucleic Acids Research
1128 papers in training set
Top 5%
3.9%
50% of probability mass above
9
PLOS Computational Biology
1633 papers in training set
Top 9%
3.9%
10
Genome Biology
555 papers in training set
Top 2%
3.5%
11
Bioinformatics Advances
184 papers in training set
Top 2%
3.0%
12
Nature Biotechnology
147 papers in training set
Top 3%
3.0%
13
Nature Methods
336 papers in training set
Top 4%
2.1%
14
Cell Reports Methods
141 papers in training set
Top 2%
1.9%
15
Patterns
70 papers in training set
Top 0.7%
1.9%
16
NAR Genomics and Bioinformatics
214 papers in training set
Top 2%
1.7%
17
PLOS ONE
4510 papers in training set
Top 55%
1.6%
18
Viruses
318 papers in training set
Top 3%
1.6%
19
GigaScience
172 papers in training set
Top 2%
1.3%
20
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 37%
1.3%
21
Scientific Reports
3102 papers in training set
Top 66%
1.2%
22
Frontiers in Bioinformatics
45 papers in training set
Top 0.7%
0.9%
23
Journal of Chemical Information and Modeling
207 papers in training set
Top 3%
0.8%
24
eLife
5422 papers in training set
Top 60%
0.7%
25
Cell Host & Microbe
113 papers in training set
Top 6%
0.6%
26
mSphere
281 papers in training set
Top 7%
0.6%
27
Microbial Genomics
204 papers in training set
Top 3%
0.6%
28
Computers in Biology and Medicine
120 papers in training set
Top 6%
0.6%