The SARS-CoV-2 Integrated Genomic Epidemiology Database (IGED): Linking viral genomes with patient-level metadata to advance statewide genomic surveillance in California
Ryder, R.; Elder, J.; Panditrao, M.; Grosgebauer, K.; Katz, R.; Tello, L.; Carroll, E.; Borthwick, D.; Kaur, C.; Smith, R.; Shiau, V.; Wheeler, W.; Reilly, E.; Myers, J.; Nelson, L.; Lim, E.; Arunleung, P.; Baylis, E.; Gilliam, S.; Hennesy-Burt, T.; Bregman, B.; Silver, E.; Kapsak, C.; Wright, S.; Leon, T.; Bell, J.; Morales, C.; Wadford, D. A.
Show abstract
In July 2021, the California Code of Regulations Title 17 required all laboratories performing SARS-CoV-2 whole genome sequencing (WGS) to report their sequencing results to the California Department of Public Health (CDPH). These viral genomic data and patient metadata were compiled into the Integrated Genomic Epidemiology Database (IGED). Linking anonymized viral sequences with patient-level information enabled monitoring of infectiousness, pathogenicity, transmission dynamics, evolution, and vaccine evasion among emerging SARS-CoV-2 lineages. Laboratories performing SARS-CoV-2 WGS transmitted sequencing results to CDPH through Electronic Laboratory Reporting (ELR) and non-ELR pathways. CDPH applied uniform reporting requirements but allowed flexibility in specific data formats to accommodate diverse data systems. To preserve data quality and interoperability across heterogeneous sources, CDPH implemented standardization, validation, and deduplication protocols. Snowflake, a cloud-based data storage and analytics platform, and Posit Connect, a cloud deployment and automation platform, supported the management, processing, and integration of data within the IGED. The IGED established links between SARS-CoV-2 WGS data and epidemiologic metadata for 801,418 sequences, representing 81.7% of all sequences reported in California. Lineages reported to the IGED showed strong concordance with lineage proportions in GISAID. Sequences reported to the IGED had average turnaround times longer than one month, and the majority of sequencing was performed in Southern California and Los Angeles. The IGED enhanced genomic surveillance through predictive modeling and monitoring concerning evolutionary trends such as recombination and saltations in persistent infections. Development of the IGED highlighted the need for standardized data requirements, sustained funding for sequencing, incentives for data submission, and interdisciplinary collaboration to build an effective genomic surveillance system. This framework for linking genomic and epidemiologic data has not only generated critical insights for SARS-CoV-2 but also provided the foundation for CDPH and other public health organizations to develop similar IGED-like systems for other priority pathogens as genomic surveillance expands.
Matching journals
The top 10 journals account for 50% of the predicted probability mass.