Basic Baseline model design choices can substantially influence performance in collaborative forecast hubs
Suez, E.; Fox, S. J.
Show abstract
Over the past decade, outbreak forecasting has become an increasingly used tool to assist public health decision-making during epidemics. Collaborative forecast hubs, where multiple teams submit predictions in real-time, are the gold standard for such efforts. For each hub, a Baseline model is used as a performance benchmark for other models. Although the Baseline is understood as a naive forecast, its design is subjective, and the impact of model design decisions remains understudied. We evaluated how three Baseline specification decisions influence forecast performance on trend models that forecast based on historically observed dynamics: (1) the amount of historical data used for training, (2) whether the data are transformed, and (3) whether forecasts follow a flatline variant (constant predictions) or a drift variant (allowing a slope). Retrospective forecasts were generated for multiple years across four surveillance targets: COVID-19, influenza and RSV hospital admissions, and weighted influenza-like illness percentage. For wILI, we additionally compared trend baselines with a seasonal baseline model leveraging long-term historical patterns. Model specification significantly altered performance. The optimal performing model across targets was a flatline model that used the most recent 6-12 transformed observations. The optimal model outperforms the current standard Baseline used in many forecast hubs by an average of 9.6% (range: 3.7-12.9%) across forecast targets, and it outperformed the seasonal baseline model by 32.3% across nine influenza seasons. Our results demonstrate that subjective Baseline design decisions can materially influence forecast accuracy and, consequently, the perceived rankings of models within collaborative forecast hubs. Based on the varying approaches and their performance differences, these findings highlight the need for increased transparency in Baseline model specifications and support the routine inclusion of multiple benchmark models within collaborative forecast hubs.
Matching journals
The top 4 journals account for 50% of the predicted probability mass.