Integrating 730,947 exome sequences with clinical literature improves gene discovery
Guez, J.; Goodrich, J. K.; Moldovan, M. A.; Chao, K. R.; Kar, P.; Panchal, R.; Wilson, M. W.; Laricchia, K. M.; Rohlicek, G.; Biba, D.; Marten, D.; He, Q.; Darnowsky, P. W.; Grant, R.; Weisburd, B.; Baxter, S. M.; Nadeau, J.; Lu, W.; Jahl, S.; Parsa, S.; Lamane, A.; DiTroia, S.; Fu, J.; Zhao, X.; Alarmani, E.; Tolonen, C.; Novod, S.; Bryant, S.; Stevens, C.; Chapman, S. B.; Cusick, C.; Vittal, C.; Gauthier, L. D.; Goldstein, J. I.; Goldstein, D.; King, D.; gnomAD Project Consortium, ; Tranchero, M.; Lotter, W.; MacArthur, D. G.; Brand, H.; Seplyarskiy, V.; Koch, E.; Talkowski, M. E.; Solomons
Show abstract
Accurate estimates of allele frequencies aid in genetic discovery, including rare disease diagnosis, common disease investigations, and population genetics. Here, we present the Genome Aggregation Database version 4 (gnomAD v4), comprising 807,162 sequenced individuals including 730,947 exomes, a fivefold increase over previous releases, and 76,215 genomes. We demonstrate that statistical power to detect strong selective constraint continues to increase with sample size. We develop a new loss-of-function annotation pipeline, which learns genomic features predictive of nonsense-mediated decay and splicing effects from selection signals, achieving 90% precision for distinguishing likely true versus false positive loss-of-function variants. This improved pipeline, along with incorporation of highly deleterious missense variants into measures of loss-of-function intolerance, improves disease gene detection particularly for short genes and those with gain-of-function mechanisms. To improve disease gene prediction, we systematically extract gene-disease associations from biomedical literature, map these to gene-level biological features, and integrate both with refined constraint metrics within a Bayesian framework, yielding state-of-the-art prediction of gene-disease relevance. Building on this integration, we define a Discovery Potential (DisPo) score that highlights genes under strong constraint but limited clinical characterization. High-DisPo genes are enriched in embryonic lethal and fertility phenotypes, supporting DisPo as a tool to prioritize previously under-characterized disease genes. Together, these advances establish a unified framework for accelerating gene discovery and improving rare disease diagnosis.
Matching journals
The top 4 journals account for 50% of the predicted probability mass.