Back

Simple cumulative weighting of routine surveillance data identifies epidemic wave origins more accurately than a large language model: evidence from eight COVID-19 waves in Japan

Nakagawa, S.; Yamamoto, A.

2026-06-03 public and global health
10.64898/2026.06.02.26354691 medRxiv
Show abstract

Identifying the origin of an emerging epidemic wave within days of onset could enable targeted response before national spread, yet current methods rely on genomic sequencing that lags clinical detection by 2-4 weeks. We analysed daily COVID-19 cases from Japan's 47 prefectures across eight waves (2020-2023), aggregated into 11 regional blocks. Wave onset was defined by the first difference of the K-value (K'). Six surveillance indicators were evaluated with and without cumulative historical weighting ({lambda} = 0.75) and benchmarked against a large language model (Claude Haiku), scored by F1 against genomically confirmed origins. At 14 days after onset, cumulative weighting of peak and cumulative incidence (B1+prior, B3+prior) reached mean F1 = 0.622, exceeding the model (0.524); the gap was largest in Wave 7 (1.000 vs 0.333). Simple cumulative weighting of routine surveillance data identified wave origins more accurately than a language model, without proprietary tools or sequencing.

Matching journals

The top 7 journals account for 50% of the predicted probability mass.

1
eLife
5422 papers in training set
Top 3%
14.0%
2
Nature Communications
4913 papers in training set
Top 15%
12.1%
3
Scientific Reports
3102 papers in training set
Top 7%
9.9%
4
PLOS ONE
4510 papers in training set
Top 32%
4.7%
5
Journal of Medical Internet Research
85 papers in training set
Top 1.0%
4.7%
6
npj Digital Medicine
97 papers in training set
Top 1%
4.1%
7
Genome Medicine
154 papers in training set
Top 2%
3.8%
50% of probability mass above
8
The Lancet Infectious Diseases
71 papers in training set
Top 1.0%
2.8%
9
Eurosurveillance
80 papers in training set
Top 0.4%
2.5%
10
Molecular Systems Biology
142 papers in training set
Top 0.4%
2.3%
11
Epidemiology and Infection
84 papers in training set
Top 1%
1.7%
12
Journal of The Royal Society Interface
189 papers in training set
Top 3%
1.7%
13
PLOS Biology
408 papers in training set
Top 10%
1.7%
14
Patterns
70 papers in training set
Top 1%
1.6%
15
BMC Medicine
163 papers in training set
Top 4%
1.6%
16
BMC Infectious Diseases
118 papers in training set
Top 3%
1.3%
17
The Lancet Regional Health - Western Pacific
15 papers in training set
Top 0.1%
1.3%
18
Journal of Infection
71 papers in training set
Top 2%
1.2%
19
International Journal of Infectious Diseases
126 papers in training set
Top 2%
1.2%
20
PLOS Computational Biology
1633 papers in training set
Top 21%
1.1%
21
JMIR Public Health and Surveillance
45 papers in training set
Top 3%
0.9%
22
Nature Medicine
117 papers in training set
Top 4%
0.9%
23
Communications Biology
886 papers in training set
Top 17%
0.9%
24
eBioMedicine
130 papers in training set
Top 3%
0.9%
25
Frontiers in Public Health
140 papers in training set
Top 8%
0.8%
26
The Lancet Digital Health
25 papers in training set
Top 1%
0.8%
27
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 44%
0.8%
28
Emerging Infectious Diseases
103 papers in training set
Top 3%
0.7%
29
Heliyon
146 papers in training set
Top 7%
0.7%
30
Cell
370 papers in training set
Top 18%
0.7%