Back

Predictive Modeling of COVID-19 Variant Peak Prevalence and Duration Using GISAID Data Across 15 Countries

Zhang, Y.; Rob, P.; Chen, K.; Overton, C. E.; Jung, J.; Jo, Y.

2026-02-05 infectious diseases
10.64898/2026.02.04.26345559 medRxiv
Show abstract

BackgroundRapid emergence and replacement of SARS-CoV-2 variants underscore the need for early and reliable indicators of variant dominance to guide timely public health response. However, early genomic trajectories are typically short, sparse, and noisy, with strong fluctuations and substantial cross-country heterogeneity in sequencing intensity and reporting. MethodsWe develop a scalable forecasting framework that predicts whether new variants will reach high prevalence and how long they will persist based on their initial genomic growth patterns. Using more than nine million sequences from 15 countries (GISAID, 2020-2024), we characterize dominance through peak prevalence and duration above 10% and extract early growth descriptors from the first 2-4 weeks after a lineage surpasses 1% frequency. Outcomes were classified using multiple models (GLM, GAM, SVM, CART, Elastic Net, and SuperLearner). We evaluated performance based on accuracy and utilized SHAP analysis to interpret feature importance. ResultsThe Super Learner ensemble model achieved the best performance, achieving up to 0.76 accuracy for peak-share prediction, and up to 0.70 accuracy for duration classification--substantially outperforming all individual models. SHAP analysis showed that variants achieving high peaks exhibit strong but structurally coherent early growth, whereas prolonged dominance is associated not with early surges but with sustained, moderate short-term fluctuations embedded within a stable trajectory. ConclusionThis framework defines minimum surveillance thresholds ([≥]100 sequences in 30 days, [≥]1% detection share), variant grouping rules, and noise-filtering protocols, enabling cross-country comparison and country-specific forecasting. It provides a lightweight, reproducible early-warning tool for genomic surveillance and real-time epidemic intelligence. SignificanceIdentifying emerging SARS-CoV-2 variants capable of driving new surges is critical for global preparedness but remains challenging due to sparse early data. We present a machine learning framework that forecasts variant dominance using only the first 2-4 weeks of genomic growth. Analyzing nine million sequences across 15 countries, we reveal two distinct epidemiological signatures: high peak prevalence is driven by explosive, coherent early expansion, while long-term persistence is predicted by sustained, moderate fluctuations rather than initial speed. By establishing minimum surveillance thresholds, this work delivers a scalable, data-efficient early-warning tool that links early genomic signatures of viral fitness to downstream population-level dominance, achieving high predictive accuracy with a minimal number of sequences.

Matching journals

The top 8 journals account for 50% of the predicted probability mass.

1
Nature Communications
4913 papers in training set
Top 10%
14.5%
2
Nature Medicine
117 papers in training set
Top 0.3%
6.3%
3
The Lancet Infectious Diseases
71 papers in training set
Top 0.4%
6.3%
4
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 12%
6.2%
5
Communications Medicine
85 papers in training set
Top 0.1%
4.8%
6
Nature Computational Science
50 papers in training set
Top 0.1%
4.2%
7
Science Translational Medicine
111 papers in training set
Top 0.6%
4.1%
8
PLOS Computational Biology
1633 papers in training set
Top 9%
3.9%
50% of probability mass above
9
eBioMedicine
130 papers in training set
Top 0.4%
3.5%
10
Patterns
70 papers in training set
Top 0.4%
2.4%
11
Scientific Reports
3102 papers in training set
Top 54%
1.9%
12
BMC Medicine
163 papers in training set
Top 4%
1.7%
13
Genome Medicine
154 papers in training set
Top 5%
1.6%
14
Nature Biotechnology
147 papers in training set
Top 5%
1.6%
15
Virus Evolution
140 papers in training set
Top 0.9%
1.5%
16
Science
429 papers in training set
Top 15%
1.5%
17
Clinical Infectious Diseases
231 papers in training set
Top 3%
1.5%
18
PNAS Nexus
147 papers in training set
Top 0.4%
1.5%
19
PLOS ONE
4510 papers in training set
Top 59%
1.3%
20
Nature
575 papers in training set
Top 12%
1.3%
21
Med
38 papers in training set
Top 0.5%
1.2%
22
Cell Reports Medicine
140 papers in training set
Top 5%
1.2%
23
npj Digital Medicine
97 papers in training set
Top 3%
1.2%
24
Cell Reports Methods
141 papers in training set
Top 4%
0.9%
25
BMC Infectious Diseases
118 papers in training set
Top 5%
0.9%
26
Nature Genetics
240 papers in training set
Top 7%
0.9%
27
PLOS Biology
408 papers in training set
Top 18%
0.8%
28
Briefings in Bioinformatics
326 papers in training set
Top 7%
0.7%
29
Cell Systems
167 papers in training set
Top 12%
0.7%
30
eLife
5422 papers in training set
Top 59%
0.7%