Back

Predicting COVID-19 incidence from seroprevalence and population-based cohort data using interpretable machine learning with differential privacy analysis

Krepel, J.; Binkyte, R.; Kerkouche, R.; Harries, M.; Klett-Tammen, C. J.; Fritz, M.; Kesselheim, S.; Kuehn, M.; Bazarova, A.; Lange, B.

2026-04-02 epidemiology

10.64898/2026.04.01.26349876 medRxiv

Show abstract

During the COVID-19 pandemic, reported incidence data played a central role in public health surveillance and in tracking epidemic dynamics, although they provide limited insight into the behavioral, immunological, and socioeconomic drivers of transmission.Population-based seroprevalence studies with linked survey data offer a rich but untapped source of individual-level information that can complement routine surveillance. In this study, we investigate whether aggregated seroprevalence cohort data can be leveraged to predict local COVID-19 incidence and to identify interpretable predictors associated with transmission dynamics. Using data from the Multilocal SeroPrevalence (MuSPAD) study in Germany (2020--2022), we trained multiple machine learning models, including least absolute shrinkage and selection operator (LASSO), vector autoregressive models (VAR), multilayer perceptrons (MLPs), and long short-term memory neural networks (LSTMs), to predict location-specific seven-day incidence rates. Feature importance was assessed using regression coefficients where applicable and model-agnostic explainability methods, including Local Interpretable Model-agnostic Explanations (LIME) and SHapley Additive exPlanations (SHAP). Across model classes, cohort-derived features enabled accurate prediction of local incidence, with time-aware models achieving the strongest performance. Consistent predictors included prior infection and testing history, employment-related changes, vaccination status, and mask-wearing behavior, highlighting the importance of behavioral and reporting-related signals. While differential privacy introduced modest degradation in predictive performance under strict privacy budgets, SHAP-based explanations remained stable, and LIME-based explanations were more sensitive to privacy-induced noise. These results demonstrate that aggregated cohort data encode meaningful and interpretable signals of population-level transmission dynamics. Population-based serosurveys therefore provide a complementary source of information for predicting local COVID-19 incidence and identifying key drivers of transmission beyond routine surveillance data. Our findings show that integrating interpretable machine learning with privacy-aware analysis enables actionable insights from sensitive cohort data, supporting their use in digital epidemiology and informing data-driven public health decision-making.

Predicting COVID-19 incidence from seroprevalence and population-based cohort data using interpretable machine learning with differential privacy analysis

Matching journals