An Improved Dataset for Predicting Mammal Infecting Viruses from Genetic Sequence Information
Reddy, T.; Schneider, A.; Hall, A. R.; Witmer, A.; Hengartner, N.
Show abstract
There have been several attempts to develop machine learning (ML) models to identify human infecting viruses from their genomic sequences, with varying degrees of success. Direct comparison between models is problematic, because these models are typically trained and evaluated on different datasets with alter-native data splitting schemes, features, and model performance metrics. In this paper we present a standardized dataset of mammal infecting and non-infecting viral pathogens, refined from the previous work of Mollentze et al. to include the latest literature evidence, roughly doubling the number of curated host-virus records available to the community, and new host target labels, primate and mammal. The new host labels were included for several reasons, including previous reports that classification performance is better at broader taxonomic ranks and the idea that there may be more data for primate infection that might serve as a suitable proxy for zoonotic potential and avoidance of false positives for human infection due to absence of evidence. On this dataset, we report the performance of eight machine learning models for predicting mammal-infecting viruses from their genomic sequences. We find that randomly assigning cases in our improved dataset to training/testing sets, when compared to the original assignments into training/testing in Mollentze et al., increases the overall average ROC AUC of prediction of human infection from 0.663 {+/-} 0.070 to 0.784 {+/-} 0.013, consistent with the reduction in phylogenetic distance between train and test sets (relative entropy change from 3.00 to 0.08). The broadest host category of mammal infection can be predicted most reliably at 0.850 {+/-} 0.020. We share our improved dataset and code to enable standardized comparisons of machine learning methods to predict human host infections. Overall, we have presented preliminary evidence that classification of virus host infection is more tractable at higher taxonomic ranks, that unsurprisingly reducing the phylogenetic distance between training and test sets can improve predictive performance, that peptide kmer features appear to be harmful to out of sample model performance, and we are left with the question of whether models for virus host prediction can reasonably be expected to perform well in out of sample scenarios given the likelihood that viruses do not share a common ancestor. Consistent with this concern, when the data is resampled such that there is no overlap between viral families in training and test sets (relative entropy > 24), models perform no bet-ter than random chance at prediction of human infection regardless of whether kmers are included (ROC AUC 0.50 {+/-} 0.08) or not (ROC AUC 0.50 {+/-} 0.04). Author SummaryDetermining whether a virus can infect a human or other animal based on its genetic information is useful for assessing the threat level of circulating and newly emerging viruses. Previous studies in this domain have had access to limited datasets, and in this work we nearly double the amount of manually labelled host data for viral infection, so that others may build on it and improve it further. We use machine learning models to rank the likelihood of human and mammal infection for viruses in this improved dataset. Results are consistent with the determination of host infection being more tractable for broader categories of hosts, like mammals, than for specific species, like humans. This may suggest that the prospects are good for improved future models that first screen viruses based on their likelihood of infecting mammals, and then in a second stage for likelihood of human infection. The most challenging scenarios were for predictions of viruses that were not similar to viruses in the training data, and the question remains whether we can expect reasonable generalization of predictive models to completely new viruses given that, at the time of writing, viruses do not appear to share a common ancestor.
Matching journals
The top 8 journals account for 50% of the predicted probability mass.