Back

Predicting COVID-19 incidence from seroprevalence and population-based cohort data using interpretable machine learning with differential privacy analysis

Krepel, J.; Binkyte, R.; Kerkouche, R.; Harries, M.; Klett-Tammen, C. J.; Fritz, M.; Kesselheim, S.; Kuehn, M.; Bazarova, A.; Lange, B.

2026-04-02 epidemiology
10.64898/2026.04.01.26349876 medRxiv
Show abstract

During the COVID-19 pandemic, reported incidence data played a central role in public health surveillance and in tracking epidemic dynamics, although they provide limited insight into the behavioral, immunological, and socioeconomic drivers of transmission.Population-based seroprevalence studies with linked survey data offer a rich but untapped source of individual-level information that can complement routine surveillance. In this study, we investigate whether aggregated seroprevalence cohort data can be leveraged to predict local COVID-19 incidence and to identify interpretable predictors associated with transmission dynamics. Using data from the Multilocal SeroPrevalence (MuSPAD) study in Germany (2020--2022), we trained multiple machine learning models, including least absolute shrinkage and selection operator (LASSO), vector autoregressive models (VAR), multilayer perceptrons (MLPs), and long short-term memory neural networks (LSTMs), to predict location-specific seven-day incidence rates. Feature importance was assessed using regression coefficients where applicable and model-agnostic explainability methods, including Local Interpretable Model-agnostic Explanations (LIME) and SHapley Additive exPlanations (SHAP). Across model classes, cohort-derived features enabled accurate prediction of local incidence, with time-aware models achieving the strongest performance. Consistent predictors included prior infection and testing history, employment-related changes, vaccination status, and mask-wearing behavior, highlighting the importance of behavioral and reporting-related signals. While differential privacy introduced modest degradation in predictive performance under strict privacy budgets, SHAP-based explanations remained stable, and LIME-based explanations were more sensitive to privacy-induced noise. These results demonstrate that aggregated cohort data encode meaningful and interpretable signals of population-level transmission dynamics. Population-based serosurveys therefore provide a complementary source of information for predicting local COVID-19 incidence and identifying key drivers of transmission beyond routine surveillance data. Our findings show that integrating interpretable machine learning with privacy-aware analysis enables actionable insights from sensitive cohort data, supporting their use in digital epidemiology and informing data-driven public health decision-making.

Matching journals

The top 5 journals account for 50% of the predicted probability mass.

1
Nature Communications
4913 papers in training set
Top 4%
22.0%
2
Epidemics
104 papers in training set
Top 0.1%
9.9%
3
Scientific Reports
3102 papers in training set
Top 20%
6.2%
4
Science Advances
1098 papers in training set
Top 1%
6.2%
5
npj Digital Medicine
97 papers in training set
Top 0.8%
6.2%
50% of probability mass above
6
Nature Medicine
117 papers in training set
Top 0.7%
3.9%
7
Nature Human Behaviour
85 papers in training set
Top 1%
3.5%
8
Communications Medicine
85 papers in training set
Top 0.1%
3.5%
9
eLife
5422 papers in training set
Top 27%
3.5%
10
International Journal of Epidemiology
74 papers in training set
Top 0.7%
3.2%
11
PLOS Computational Biology
1633 papers in training set
Top 11%
3.0%
12
American Journal of Epidemiology
57 papers in training set
Top 0.5%
2.3%
13
The Lancet Digital Health
25 papers in training set
Top 0.3%
2.0%
14
PLOS ONE
4510 papers in training set
Top 49%
2.0%
15
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 28%
2.0%
16
Science Translational Medicine
111 papers in training set
Top 2%
1.8%
17
Patterns
70 papers in training set
Top 1%
1.5%
18
Swiss Medical Weekly
12 papers in training set
Top 0.3%
0.8%
19
International Journal of Medical Informatics
25 papers in training set
Top 2%
0.7%
20
Philosophical Transactions of the Royal Society B
51 papers in training set
Top 6%
0.7%
21
Eurosurveillance
80 papers in training set
Top 2%
0.7%
22
PLOS Digital Health
91 papers in training set
Top 3%
0.7%
23
Journal of The Royal Society Interface
189 papers in training set
Top 5%
0.7%
24
BMC Medicine
163 papers in training set
Top 8%
0.7%
25
Science
429 papers in training set
Top 22%
0.6%
26
Med
38 papers in training set
Top 1%
0.6%