Back

CladePredictor - MPXV: An alignment-free Artificial Intelligence-based classifier of complete and partial mpox virus genomes

Sganzerla Martinez, G.; Toloue Ostadgavahi, A.; Dutt, M.; Maguire, F.; Pena-Castillo, L.; Kelvin, D. J.; Kumar, A.

2026-04-28 infectious diseases
10.64898/2026.04.27.26351821 medRxiv
Show abstract

Poxviruses constitute a threat to human health. Since 2022, two public health emergencies of international concern due to global spread of mpox viruses (MPXVs) were declared. The emergence of the novel MPXV subclade Ib has placed the global health community on alert as sustained human-to-human and travel-related transmission is prevalent in Africa and 30 non-African countries. Metagenomic and outbreak surveillance data often generates complete as well as partial assemblies of genomes which then require efficient taxonomic classification. Traditional viral genome classifiers rely on poorly scalable alignment methods creating computational bottlenecks in taxonomic classifications. Here, we present CladePredictor- MPXV: an alignment-free AI-based classifier of complete and partial MPXV genomes. Our classification framework consists of an ensemble of XGBoost and CNNs to classify between subclades Ia, Ib and IIb. CladePredictor-MPXV was trained with 3,866 MPXV genomes. XGBoost models were trained with 3-mers which are representative of the global feature space of complete MPXV genomes. CNNs were trained with short-range, position-independent sequence patterns to assign clades to partial genomes with a minimum size of 1000 nucleotides. Our XGBoost instance attained a weighted average accuracy of 90.2% while our CNN instance attained a weighted average accuracy of 95% in classifying clade (I vs II) and subclade (Ia vs Ib) from complete (>= 188,000 nucleotides) and partial MPXV genomes on a phylogenetically distinct validation set. CladePredictor-MPXV is freely available at https://clade-predictor.microbiologyandimmunology.dal.ca and provides a fast and efficient framework for the assignment of clades to MPXV subclade Ia, Ib, and IIb complete and partial genomes.

Matching journals

The top 11 journals account for 50% of the predicted probability mass.

1
Virus Evolution
140 papers in training set
Top 0.1%
14.9%
2
Nature Communications
4913 papers in training set
Top 28%
6.4%
3
Viruses
318 papers in training set
Top 1.0%
4.9%
4
Nature Medicine
117 papers in training set
Top 0.7%
4.0%
5
PLOS Pathogens
721 papers in training set
Top 3%
3.7%
6
Nature Microbiology
133 papers in training set
Top 1%
3.7%
7
Nature Computational Science
50 papers in training set
Top 0.1%
3.6%
8
PLOS Computational Biology
1633 papers in training set
Top 11%
3.1%
9
Scientific Reports
3102 papers in training set
Top 43%
2.8%
10
mBio
750 papers in training set
Top 6%
2.6%
11
Nature Biotechnology
147 papers in training set
Top 4%
1.9%
50% of probability mass above
12
mSphere
281 papers in training set
Top 3%
1.8%
13
Microbial Genomics
204 papers in training set
Top 1%
1.7%
14
The Lancet Infectious Diseases
71 papers in training set
Top 1%
1.7%
15
PLOS ONE
4510 papers in training set
Top 54%
1.7%
16
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 35%
1.5%
17
Frontiers in Microbiology
375 papers in training set
Top 6%
1.3%
18
PLOS Biology
408 papers in training set
Top 13%
1.2%
19
Cell Host & Microbe
113 papers in training set
Top 4%
1.1%
20
Science Translational Medicine
111 papers in training set
Top 4%
1.1%
21
Journal of Infection
71 papers in training set
Top 2%
1.0%
22
Bioinformatics
1061 papers in training set
Top 8%
1.0%
23
Science
429 papers in training set
Top 18%
0.9%
24
Briefings in Bioinformatics
326 papers in training set
Top 6%
0.9%
25
Microbiome
139 papers in training set
Top 3%
0.9%
26
Communications Biology
886 papers in training set
Top 18%
0.9%
27
Cell
370 papers in training set
Top 16%
0.8%
28
Nature
575 papers in training set
Top 15%
0.8%
29
Epidemics
104 papers in training set
Top 2%
0.8%
30
The Lancet Microbe
43 papers in training set
Top 1%
0.8%