Back

Integrating 730,947 exome sequences with clinical literature improves gene discovery

Guez, J.; Goodrich, J. K.; Moldovan, M. A.; Chao, K. R.; Kar, P.; Panchal, R.; Wilson, M. W.; Laricchia, K. M.; Rohlicek, G.; Biba, D.; Marten, D.; He, Q.; Darnowsky, P. W.; Grant, R.; Weisburd, B.; Baxter, S. M.; Nadeau, J.; Lu, W.; Jahl, S.; Parsa, S.; Lamane, A.; DiTroia, S.; Fu, J.; Zhao, X.; Alarmani, E.; Tolonen, C.; Novod, S.; Bryant, S.; Stevens, C.; Chapman, S. B.; Cusick, C.; Vittal, C.; Gauthier, L. D.; Goldstein, J. I.; Goldstein, D.; King, D.; gnomAD Project Consortium, ; Tranchero, M.; Lotter, W.; MacArthur, D. G.; Brand, H.; Seplyarskiy, V.; Koch, E.; Talkowski, M. E.; Solomons

2026-03-25 genetic and genomic medicine
10.64898/2026.03.23.26349081 medRxiv
Show abstract

Accurate estimates of allele frequencies aid in genetic discovery, including rare disease diagnosis, common disease investigations, and population genetics. Here, we present the Genome Aggregation Database version 4 (gnomAD v4), comprising 807,162 sequenced individuals including 730,947 exomes, a fivefold increase over previous releases, and 76,215 genomes. We demonstrate that statistical power to detect strong selective constraint continues to increase with sample size. We develop a new loss-of-function annotation pipeline, which learns genomic features predictive of nonsense-mediated decay and splicing effects from selection signals, achieving 90% precision for distinguishing likely true versus false positive loss-of-function variants. This improved pipeline, along with incorporation of highly deleterious missense variants into measures of loss-of-function intolerance, improves disease gene detection particularly for short genes and those with gain-of-function mechanisms. To improve disease gene prediction, we systematically extract gene-disease associations from biomedical literature, map these to gene-level biological features, and integrate both with refined constraint metrics within a Bayesian framework, yielding state-of-the-art prediction of gene-disease relevance. Building on this integration, we define a Discovery Potential (DisPo) score that highlights genes under strong constraint but limited clinical characterization. High-DisPo genes are enriched in embryonic lethal and fertility phenotypes, supporting DisPo as a tool to prioritize previously under-characterized disease genes. Together, these advances establish a unified framework for accelerating gene discovery and improving rare disease diagnosis.

Matching journals

The top 4 journals account for 50% of the predicted probability mass.