Back

An Improved Dataset for Predicting Mammal Infecting Viruses from Genetic Sequence Information

Reddy, T.; Schneider, A.; Hall, A. R.; Witmer, A.; Hengartner, N.

2026-01-25 bioinformatics
10.1101/2025.09.17.676952 bioRxiv
Show abstract

There have been several attempts to develop machine learning (ML) models to identify human infecting viruses from their genomic sequences, with varying degrees of success. Direct comparison between models is problematic, because these models are typically trained and evaluated on different datasets with alter-native data splitting schemes, features, and model performance metrics. In this paper we present a standardized dataset of mammal infecting and non-infecting viral pathogens, refined from the previous work of Mollentze et al. to include the latest literature evidence, roughly doubling the number of curated host-virus records available to the community, and new host target labels, primate and mammal. The new host labels were included for several reasons, including previous reports that classification performance is better at broader taxonomic ranks and the idea that there may be more data for primate infection that might serve as a suitable proxy for zoonotic potential and avoidance of false positives for human infection due to absence of evidence. On this dataset, we report the performance of eight machine learning models for predicting mammal-infecting viruses from their genomic sequences. We find that randomly assigning cases in our improved dataset to training/testing sets, when compared to the original assignments into training/testing in Mollentze et al., increases the overall average ROC AUC of prediction of human infection from 0.663 {+/-} 0.070 to 0.784 {+/-} 0.013, consistent with the reduction in phylogenetic distance between train and test sets (relative entropy change from 3.00 to 0.08). The broadest host category of mammal infection can be predicted most reliably at 0.850 {+/-} 0.020. We share our improved dataset and code to enable standardized comparisons of machine learning methods to predict human host infections. Overall, we have presented preliminary evidence that classification of virus host infection is more tractable at higher taxonomic ranks, that unsurprisingly reducing the phylogenetic distance between training and test sets can improve predictive performance, that peptide kmer features appear to be harmful to out of sample model performance, and we are left with the question of whether models for virus host prediction can reasonably be expected to perform well in out of sample scenarios given the likelihood that viruses do not share a common ancestor. Consistent with this concern, when the data is resampled such that there is no overlap between viral families in training and test sets (relative entropy > 24), models perform no bet-ter than random chance at prediction of human infection regardless of whether kmers are included (ROC AUC 0.50 {+/-} 0.08) or not (ROC AUC 0.50 {+/-} 0.04). Author SummaryDetermining whether a virus can infect a human or other animal based on its genetic information is useful for assessing the threat level of circulating and newly emerging viruses. Previous studies in this domain have had access to limited datasets, and in this work we nearly double the amount of manually labelled host data for viral infection, so that others may build on it and improve it further. We use machine learning models to rank the likelihood of human and mammal infection for viruses in this improved dataset. Results are consistent with the determination of host infection being more tractable for broader categories of hosts, like mammals, than for specific species, like humans. This may suggest that the prospects are good for improved future models that first screen viruses based on their likelihood of infecting mammals, and then in a second stage for likelihood of human infection. The most challenging scenarios were for predictions of viruses that were not similar to viruses in the training data, and the question remains whether we can expect reasonable generalization of predictive models to completely new viruses given that, at the time of writing, viruses do not appear to share a common ancestor.

Matching journals

The top 8 journals account for 50% of the predicted probability mass.

1
PLOS Computational Biology
1633 papers in training set
Top 3%
10.2%
2
BMC Bioinformatics
383 papers in training set
Top 1%
8.5%
3
PLOS ONE
4510 papers in training set
Top 22%
8.5%
4
Scientific Reports
3102 papers in training set
Top 23%
4.9%
5
Virus Evolution
140 papers in training set
Top 0.3%
4.9%
6
Bioinformatics Advances
184 papers in training set
Top 0.6%
4.9%
7
Frontiers in Bioinformatics
45 papers in training set
Top 0.1%
4.9%
8
Bioinformatics
1061 papers in training set
Top 5%
4.4%
50% of probability mass above
9
Viruses
318 papers in training set
Top 1%
4.2%
10
PeerJ
261 papers in training set
Top 3%
3.3%
11
ImmunoInformatics
11 papers in training set
Top 0.1%
2.6%
12
Biology Methods and Protocols
53 papers in training set
Top 0.8%
1.8%
13
Biology
43 papers in training set
Top 0.9%
1.5%
14
Frontiers in Immunology
586 papers in training set
Top 5%
1.5%
15
GigaScience
172 papers in training set
Top 2%
1.3%
16
Journal of Computational Biology
37 papers in training set
Top 0.3%
1.3%
17
F1000Research
79 papers in training set
Top 3%
1.2%
18
Patterns
70 papers in training set
Top 1%
1.2%
19
Frontiers in Genetics
197 papers in training set
Top 7%
1.1%
20
Frontiers in Virology
15 papers in training set
Top 0.1%
1.1%
21
JAIDS Journal of Acquired Immune Deficiency Syndromes
19 papers in training set
Top 0.3%
0.9%
22
Briefings in Bioinformatics
326 papers in training set
Top 6%
0.9%
23
BMC Genomics
328 papers in training set
Top 4%
0.9%
24
Journal of Virology
456 papers in training set
Top 3%
0.8%
25
mSphere
281 papers in training set
Top 6%
0.8%
26
Journal of Chemical Information and Modeling
207 papers in training set
Top 3%
0.8%
27
PLOS Biology
408 papers in training set
Top 21%
0.7%
28
Wellcome Open Research
57 papers in training set
Top 3%
0.6%
29
mBio
750 papers in training set
Top 12%
0.6%
30
Journal of Theoretical Biology
144 papers in training set
Top 2%
0.6%