Predictive Modeling of COVID-19 Variant Peak Prevalence and Duration Using GISAID Data Across 15 Countries
Zhang, Y.; Rob, P.; Chen, K.; Overton, C. E.; Jung, J.; Jo, Y.
Show abstract
BackgroundRapid emergence and replacement of SARS-CoV-2 variants underscore the need for early and reliable indicators of variant dominance to guide timely public health response. However, early genomic trajectories are typically short, sparse, and noisy, with strong fluctuations and substantial cross-country heterogeneity in sequencing intensity and reporting. MethodsWe develop a scalable forecasting framework that predicts whether new variants will reach high prevalence and how long they will persist based on their initial genomic growth patterns. Using more than nine million sequences from 15 countries (GISAID, 2020-2024), we characterize dominance through peak prevalence and duration above 10% and extract early growth descriptors from the first 2-4 weeks after a lineage surpasses 1% frequency. Outcomes were classified using multiple models (GLM, GAM, SVM, CART, Elastic Net, and SuperLearner). We evaluated performance based on accuracy and utilized SHAP analysis to interpret feature importance. ResultsThe Super Learner ensemble model achieved the best performance, achieving up to 0.76 accuracy for peak-share prediction, and up to 0.70 accuracy for duration classification--substantially outperforming all individual models. SHAP analysis showed that variants achieving high peaks exhibit strong but structurally coherent early growth, whereas prolonged dominance is associated not with early surges but with sustained, moderate short-term fluctuations embedded within a stable trajectory. ConclusionThis framework defines minimum surveillance thresholds ([≥]100 sequences in 30 days, [≥]1% detection share), variant grouping rules, and noise-filtering protocols, enabling cross-country comparison and country-specific forecasting. It provides a lightweight, reproducible early-warning tool for genomic surveillance and real-time epidemic intelligence. SignificanceIdentifying emerging SARS-CoV-2 variants capable of driving new surges is critical for global preparedness but remains challenging due to sparse early data. We present a machine learning framework that forecasts variant dominance using only the first 2-4 weeks of genomic growth. Analyzing nine million sequences across 15 countries, we reveal two distinct epidemiological signatures: high peak prevalence is driven by explosive, coherent early expansion, while long-term persistence is predicted by sustained, moderate fluctuations rather than initial speed. By establishing minimum surveillance thresholds, this work delivers a scalable, data-efficient early-warning tool that links early genomic signatures of viral fitness to downstream population-level dominance, achieving high predictive accuracy with a minimal number of sequences.
Matching journals
The top 8 journals account for 50% of the predicted probability mass.