Back

Preparing For the Next Pandemic: Learning Wild Mutational Patterns At Scale For For Analyzing Sequence Divergence In Novel Pathogens

Li, J.; Li, T.; Chattopadhyay, I.

2020-07-19 infectious diseases
10.1101/2020.07.17.20156364 medRxiv
Show abstract

As we begin to recover from the COVID-19 pandemic, a key question is if we can avert such disasters in future. Current surveillance protocols generally focus on qualitative impact assessments of viral diversity 1. These efforts are primarliy aimed at ecosystem and human impact monitoring, and do not help to precisely quantify emergence. Currently, the similarity of biological strains is measured by the edit distance or the number of mutations that separate their genomic sequences 2-6, e.g. the number of mutations that make an avian flu strain human-adapted. However, ignoring the odds of those mutations in the wild keeps us blind to the true jump risk, and gives us little indication of which strains are more risky. In this study, we develop a more meaningful metric for comparison of genomic sequences. Our metric, the q-distance, precisely quantifies the probability of spontaneous jump by random chance. Learning from patterns of mutations from large sequence databases, the q-distance adapts to the specific organism, the background population, and realistic selection pressures; demonstrably improving inference of ancestral relationships and future trajectories. As important application, we show that the q-distance predicts future strains for seasonal Influenza, outperforming World Health Organization (WHO) recommended flu-shot composition almost consistently over two decades. Such performance is demonstrated separately for Northern and Southern hemisphere for different subtypes, and key capsidic proteins. Additionally, we investigate the SARS-CoV-2 origin problem, and precisely quantify the likelihood of different animal species that hosted an immediate progenitor, producing a list of related species of bats that have a quantifiably high likelihood of being the source. Additionally, we identify specific rodents with a credible likelihood of hosting a SARS-CoV-2 ancestor. Combining machine learning and large deviation theory, the analysis reported here may open the door to actionable predictions of future pandemics.

Matching journals

The top 4 journals account for 50% of the predicted probability mass.

1
Nature Computational Science
50 papers in training set
Top 0.1%
22.1%
2
Scientific Reports
3102 papers in training set
Top 7%
9.9%
3
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 6%
9.9%
4
Nature Communications
4913 papers in training set
Top 19%
9.9%
50% of probability mass above
5
Cell Reports Methods
141 papers in training set
Top 1.0%
3.5%
6
Nature Medicine
117 papers in training set
Top 0.9%
3.5%
7
iScience
1063 papers in training set
Top 5%
3.5%
8
Nature Biotechnology
147 papers in training set
Top 3%
3.5%
9
eLife
5422 papers in training set
Top 31%
2.7%
10
Communications Biology
886 papers in training set
Top 4%
2.6%
11
Nature
575 papers in training set
Top 9%
2.3%
12
Virus Evolution
140 papers in training set
Top 0.8%
1.7%
13
Science Advances
1098 papers in training set
Top 21%
1.5%
14
Molecular Biology and Evolution
488 papers in training set
Top 3%
1.3%
15
Cell Reports
1338 papers in training set
Top 29%
1.2%
16
Cell Systems
167 papers in training set
Top 10%
0.9%
17
Communications Medicine
85 papers in training set
Top 0.8%
0.9%
18
Patterns
70 papers in training set
Top 2%
0.9%
19
The Lancet Microbe
43 papers in training set
Top 1%
0.7%
20
Emerging Infectious Diseases
103 papers in training set
Top 3%
0.7%
21
Nature Methods
336 papers in training set
Top 7%
0.7%
22
Science Translational Medicine
111 papers in training set
Top 7%
0.7%
23
Journal of Chemical Information and Modeling
207 papers in training set
Top 3%
0.6%
24
PNAS Nexus
147 papers in training set
Top 3%
0.6%
25
Nano Letters
63 papers in training set
Top 3%
0.6%
26
npj Systems Biology and Applications
99 papers in training set
Top 3%
0.6%
27
Peer Community Journal
254 papers in training set
Top 5%
0.6%
28
NAR Genomics and Bioinformatics
214 papers in training set
Top 4%
0.6%
29
Computers in Biology and Medicine
120 papers in training set
Top 6%
0.6%