Back

EA-PheWAS: Integrating Phenotype Embeddings with PheWAS for Enhanced Gene-Phenotype Discovery

Zheng, W.; Liu, T.; Xu, L.; Xie, Y.; Jing, Y.; Shao, H.; Zhao, H.

2026-04-22 genetics
10.64898/2026.04.21.720031 bioRxiv
Show abstract

Phenome-wide association studies (PheWAS) enable systematic exploration of relationships between genetic variants and clinical phenotypes derived from electronic health records (EHRs). Conventional regression-based PheWAS treats phenotypes separately and relies on binary phenotype representations, which limits statistical power for rare variants and rare phenotypes and reduces the ability to detect associations with phenotypes that are distributed across clinical codes. To address this limitation, we first developed EmbedPheScan, a phenotype embedding-based prioritization framework that summarizes the phenotypic profiles of rare loss-of-function variant carriers in a continuous embedding space. We then proposed EA-PheWAS by combining these embedding-derived signals with conventional regression-based PheWAS results using the aggregated Cauchy association test. Using the UK Biobank whole-exome sequencing and EHR data, we show that the proposed methods maintain appropriate false-positive control. We then performed genome-wide phenome scans across all genes and across biologically defined gene classes to evaluate EA-PheWAS relative to conventional PheWAS and EmbedPheScan, consistently finding that EA-PheWAS outperformed the other two methods. We illustrate the utility of EA-PheWAS focusing on four genes representing distinct scenarios, including strong-effect disease genes (PKD1, PKD2), genes with large numbers of rare LoF carriers (NF1), and genes with extremely sparse carrier counts (FBN1).

Matching journals

The top 3 journals account for 50% of the predicted probability mass.

1
The American Journal of Human Genetics
206 papers in training set
Top 0.1%
38.5%
2
Bioinformatics
1061 papers in training set
Top 4%
6.2%
3
Nature Genetics
240 papers in training set
Top 1%
6.2%
50% of probability mass above
4
Nature Communications
4913 papers in training set
Top 31%
6.2%
5
Genome Medicine
154 papers in training set
Top 2%
4.2%
6
Genetic Epidemiology
46 papers in training set
Top 0.2%
3.9%
7
Genome Research
409 papers in training set
Top 1%
3.0%
8
Science Translational Medicine
111 papers in training set
Top 2%
2.3%
9
European Journal of Human Genetics
49 papers in training set
Top 0.5%
1.8%
10
Cell Genomics
162 papers in training set
Top 3%
1.8%
11
Briefings in Bioinformatics
326 papers in training set
Top 4%
1.6%
12
PLOS Genetics
756 papers in training set
Top 10%
1.5%
13
PLOS Computational Biology
1633 papers in training set
Top 20%
1.2%
14
Human Genetics and Genomics Advances
70 papers in training set
Top 0.5%
1.2%
15
International Journal of Epidemiology
74 papers in training set
Top 2%
1.2%
16
Genome Biology
555 papers in training set
Top 6%
1.2%
17
Scientific Reports
3102 papers in training set
Top 67%
1.2%
18
Human Molecular Genetics
130 papers in training set
Top 2%
1.1%
19
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 40%
0.9%
20
Frontiers in Genetics
197 papers in training set
Top 8%
0.9%
21
Bioinformatics Advances
184 papers in training set
Top 4%
0.9%
22
Nucleic Acids Research
1128 papers in training set
Top 16%
0.9%
23
Nature Computational Science
50 papers in training set
Top 2%
0.8%
24
PLOS ONE
4510 papers in training set
Top 69%
0.7%
25
eLife
5422 papers in training set
Top 60%
0.7%