Back

The SARS-CoV-2 Integrated Genomic Epidemiology Database (IGED): Linking viral genomes with patient-level metadata to advance statewide genomic surveillance in California

Ryder, R.; Elder, J.; Panditrao, M.; Grosgebauer, K.; Katz, R.; Tello, L.; Carroll, E.; Borthwick, D.; Kaur, C.; Smith, R.; Shiau, V.; Wheeler, W.; Reilly, E.; Myers, J.; Nelson, L.; Lim, E.; Arunleung, P.; Baylis, E.; Gilliam, S.; Hennesy-Burt, T.; Bregman, B.; Silver, E.; Kapsak, C.; Wright, S.; Leon, T.; Bell, J.; Morales, C.; Wadford, D. A.

2026-05-19 health informatics
10.64898/2026.05.14.26353263 medRxiv
Show abstract

In July 2021, the California Code of Regulations Title 17 required all laboratories performing SARS-CoV-2 whole genome sequencing (WGS) to report their sequencing results to the California Department of Public Health (CDPH). These viral genomic data and patient metadata were compiled into the Integrated Genomic Epidemiology Database (IGED). Linking anonymized viral sequences with patient-level information enabled monitoring of infectiousness, pathogenicity, transmission dynamics, evolution, and vaccine evasion among emerging SARS-CoV-2 lineages. Laboratories performing SARS-CoV-2 WGS transmitted sequencing results to CDPH through Electronic Laboratory Reporting (ELR) and non-ELR pathways. CDPH applied uniform reporting requirements but allowed flexibility in specific data formats to accommodate diverse data systems. To preserve data quality and interoperability across heterogeneous sources, CDPH implemented standardization, validation, and deduplication protocols. Snowflake, a cloud-based data storage and analytics platform, and Posit Connect, a cloud deployment and automation platform, supported the management, processing, and integration of data within the IGED. The IGED established links between SARS-CoV-2 WGS data and epidemiologic metadata for 801,418 sequences, representing 81.7% of all sequences reported in California. Lineages reported to the IGED showed strong concordance with lineage proportions in GISAID. Sequences reported to the IGED had average turnaround times longer than one month, and the majority of sequencing was performed in Southern California and Los Angeles. The IGED enhanced genomic surveillance through predictive modeling and monitoring concerning evolutionary trends such as recombination and saltations in persistent infections. Development of the IGED highlighted the need for standardized data requirements, sustained funding for sequencing, incentives for data submission, and interdisciplinary collaboration to build an effective genomic surveillance system. This framework for linking genomic and epidemiologic data has not only generated critical insights for SARS-CoV-2 but also provided the foundation for CDPH and other public health organizations to develop similar IGED-like systems for other priority pathogens as genomic surveillance expands.

Matching journals

The top 10 journals account for 50% of the predicted probability mass.

1
Journal of Clinical Microbiology
120 papers in training set
Top 0.4%
6.4%
2
Nature Communications
4913 papers in training set
Top 29%
6.4%
3
Med
38 papers in training set
Top 0.1%
6.3%
4
JMIR Public Health and Surveillance
45 papers in training set
Top 0.2%
6.3%
5
Viruses
318 papers in training set
Top 1%
4.9%
6
Frontiers in Microbiology
375 papers in training set
Top 2%
4.0%
7
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.7%
4.0%
8
PLOS ONE
4510 papers in training set
Top 35%
4.0%
9
Scientific Reports
3102 papers in training set
Top 31%
4.0%
10
JAMIA Open
37 papers in training set
Top 0.4%
3.9%
50% of probability mass above
11
Annals of Internal Medicine
27 papers in training set
Top 0.1%
3.6%
12
Scientific Data
174 papers in training set
Top 0.5%
3.6%
13
Genome Medicine
154 papers in training set
Top 3%
2.6%
14
npj Digital Medicine
97 papers in training set
Top 2%
2.1%
15
BMJ Health & Care Informatics
13 papers in training set
Top 0.4%
1.7%
16
Patterns
70 papers in training set
Top 1%
1.5%
17
JCO Clinical Cancer Informatics
18 papers in training set
Top 0.5%
1.5%
18
mBio
750 papers in training set
Top 9%
1.2%
19
JAMA
17 papers in training set
Top 0.2%
1.1%
20
GigaScience
172 papers in training set
Top 2%
1.0%
21
Science Advances
1098 papers in training set
Top 25%
1.0%
22
Cell
370 papers in training set
Top 15%
1.0%
23
Science Translational Medicine
111 papers in training set
Top 5%
1.0%
24
Frontiers in Public Health
140 papers in training set
Top 7%
0.9%
25
Journal of Medical Internet Research
85 papers in training set
Top 4%
0.9%
26
Cell Reports Medicine
140 papers in training set
Top 7%
0.9%
27
eLife
5422 papers in training set
Top 58%
0.7%
28
NAR Genomics and Bioinformatics
214 papers in training set
Top 4%
0.7%
29
Philosophical Transactions of the Royal Society B
51 papers in training set
Top 6%
0.7%
30
The Lancet Digital Health
25 papers in training set
Top 1%
0.7%