Back

Overcome the Limitation of Phenome-Wide Association Studies (PheWAS): Extension of PheWAS to Efficient and Robust Large-Scale ICD Codes Analysis

Lin, Y.; Zhang, S.; Vessels, T. J.; Bastarache, L.; Bejan, C. A.; Hsi, R. S.; Phillips, E. J.; Ruderfer, D. M.; Pulley, J.; Edwards, T.; Wells, Q. S.; Warner, J. L.; Denny, J. C.; Roden, D. M.; Kang, H.; Xu, Y.

2024-04-19 health informatics
10.1101/2024.04.15.24305098 medRxiv
Show abstract

The Phenome-wide association studies (PheWAS) have become widely used for efficient, high-throughput evaluation of relationship between a genetic factor and a large number of disease phenotypes, typically extracted from a DNA biobank linked with electronic medical records (EMR). Phecodes, billing code-derived disease case-control status, are usually used as outcome variables in PheWAS and logistic regression has been the standard choice of analysis method. Since the clinical diagnoses in EMR are often inaccurate with errors which can lead to biases in the odds ratio estimates, much effort has been put to accurately define the cases and controls to ensure an accurate analysis. Specifically in order to correctly classify controls in the population, an exclusion criteria list for each Phecode was manually compiled to obtain unbiased odds ratios. However, the accuracy of the list cannot be guaranteed without extensive data curation process. The costly curation process limits the efficiency of large-scale analyses that take full advantage of all structured phenotypic information available in EMR. Here, we proposed to estimate relative risks (RR) instead. We first demonstrated the desired nature of RR that overcomes the inaccuracy in the controls via theoretical formula. With simulation and real data application, we further confirmed that RR is unbiased without compiling exclusion criteria lists. With RR as estimates, we are able to efficiently extend PheWAS to a larger-scale, phenome construction agnostic analysis of phenotypes, using ICD 9/10 codes, which preserve much more disease-related clinical information than Phecodes.

Matching journals

The top 7 journals account for 50% of the predicted probability mass.

1
Bioinformatics
1061 papers in training set
Top 3%
10.1%
2
Scientific Reports
3102 papers in training set
Top 8%
9.1%
3
Journal of Biomedical Informatics
45 papers in training set
Top 0.2%
8.4%
4
Journal of Personalized Medicine
28 papers in training set
Top 0.1%
8.4%
5
IEEE Journal of Biomedical and Health Informatics
34 papers in training set
Top 0.2%
6.3%
6
PLOS Genetics
756 papers in training set
Top 4%
4.2%
7
PLOS ONE
4510 papers in training set
Top 36%
4.0%
50% of probability mass above
8
JAMIA Open
37 papers in training set
Top 0.4%
3.6%
9
Nature Communications
4913 papers in training set
Top 40%
3.6%
10
Briefings in Bioinformatics
326 papers in training set
Top 2%
3.6%
11
PLOS Computational Biology
1633 papers in training set
Top 12%
2.7%
12
BMC Medical Genomics
36 papers in training set
Top 0.3%
2.4%
13
Communications Biology
886 papers in training set
Top 4%
2.4%
14
Journal of the American Medical Informatics Association
61 papers in training set
Top 1.0%
2.4%
15
BMC Bioinformatics
383 papers in training set
Top 4%
2.1%
16
Nature Computational Science
50 papers in training set
Top 0.9%
1.3%
17
Genomics, Proteomics & Bioinformatics
171 papers in training set
Top 4%
1.2%
18
Patterns
70 papers in training set
Top 2%
1.1%
19
NAR Genomics and Bioinformatics
214 papers in training set
Top 3%
0.9%
20
iScience
1063 papers in training set
Top 27%
0.9%
21
BMC Medical Research Methodology
43 papers in training set
Top 1%
0.8%
22
eLife
5422 papers in training set
Top 58%
0.7%
23
IEEE Transactions on Computational Biology and Bioinformatics
17 papers in training set
Top 0.7%
0.7%
24
GENETICS
189 papers in training set
Top 2%
0.7%
25
JMIR Public Health and Surveillance
45 papers in training set
Top 4%
0.7%
26
Science Advances
1098 papers in training set
Top 33%
0.6%