Back

Phenome-Wide Association Study of Pre-Cancer Diagnosis Electronic Health Records Identifies Risk and Inverse Associations in the All of Us Research Program

Rich, C. C. D.; Bang, E. J.; Bair, A. B.; Richardson, B. E.; Millington, J. L.; Bates, B. A.; Davis, M. F.; Bailey, M. H.

2026-05-28 health informatics
10.64898/2026.05.26.26353823 medRxiv
Show abstract

Background: The All of Us Research Program represents a rich resource for cancer epidemiology research, with over 400,000 participants with whole genome sequences linked to electronic health records (EHR). Large cancer datasets often focus exclusively on cases without controls and neglect pre-diagnosis healthcare occurrences. Here, we perform a phenome-wide association study (PheWAS) of EHR data at least 1 year pre-diagnosis between cancer cases and matched controls, revealing co-occurring and mutually exclusive phenotypes. Methods: We identified 55,000+ cancer cases across 21 cancer types in All of Us version 8. To eliminate age-related confounding, we implemented a two-stage matching and censoring strategy: loose matching on demographics to establish index dates and cohort comparability, followed by right-censoring of EHR data (excluding 1 year pre-diagnosis/index), then 1:2 matching to address residual demographic imbalance. We tested associations between 23,193 cancer cases, 46,386 matched controls and approximately 1,600 clinical phenotypes using logistic regression adjusted for sex at birth, self-reported race, age at diagnosis/index date, and two censored EHR metrics: observation window and unique condition count, with Bonferroni correction for multiple testing. Results: Our analysis identified 232 significantly associated phenotypes, confirming established cancer risk factors including elevated prostate specific antigen (OR = 2.92, 95% CI: 2.65-3.23; p-value=1.8x10-101) and multinodular goiter (OR = 1.73, 95% CI: 1.56-1.91; p-value=6.7x10-27). Further investigation into the relationship between several phenotypes with seeming inverse effects is warranted. Conclusions: This PheWAS of EHR data at least 1 year pre-diagnosis leveraged the diversity of All of Us to examine how clinical phenotypes prior to cancer diagnosis vary across cancer types and racial groups. Our findings validate All of Us as a robust platform for cancer epidemiology research, confirming established risk factors at scale across diverse populations. This work provides methodological insights for EHR-based susceptibility analyses and demonstrates the value of agnostic phenome-wide approaches for generating hypotheses in precision medicine.

Matching journals

The top 10 journals account for 50% of the predicted probability mass.

1
Nature Communications
4913 papers in training set
Top 22%
8.5%
2
Scientific Reports
3102 papers in training set
Top 9%
8.5%
3
JAMIA Open
37 papers in training set
Top 0.1%
6.9%
4
Cancer Medicine
24 papers in training set
Top 0.2%
4.9%
5
Journal of Biomedical Informatics
45 papers in training set
Top 0.3%
4.9%
6
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.7%
4.0%
7
PLOS ONE
4510 papers in training set
Top 38%
3.6%
8
The Lancet Digital Health
25 papers in training set
Top 0.1%
3.6%
9
Annals of Internal Medicine
27 papers in training set
Top 0.2%
3.1%
10
JCO Clinical Cancer Informatics
18 papers in training set
Top 0.3%
2.8%
50% of probability mass above
11
eBioMedicine
130 papers in training set
Top 0.5%
2.6%
12
npj Digital Medicine
97 papers in training set
Top 2%
2.5%
13
The American Journal of Human Genetics
206 papers in training set
Top 2%
1.7%
14
Journal of Personalized Medicine
28 papers in training set
Top 0.3%
1.7%
15
Patterns
70 papers in training set
Top 0.8%
1.7%
16
International Journal of Cancer
42 papers in training set
Top 0.7%
1.5%
17
Cell Reports Medicine
140 papers in training set
Top 5%
1.2%
18
Science Advances
1098 papers in training set
Top 23%
1.2%
19
JMIR Public Health and Surveillance
45 papers in training set
Top 3%
1.1%
20
Bioinformatics
1061 papers in training set
Top 8%
1.1%
21
BMJ Health & Care Informatics
13 papers in training set
Top 0.7%
1.0%
22
Genome Medicine
154 papers in training set
Top 6%
1.0%
23
JAMA Network Open
127 papers in training set
Top 3%
1.0%
24
Communications Medicine
85 papers in training set
Top 0.6%
1.0%
25
eLife
5422 papers in training set
Top 53%
0.9%
26
BMC Medicine
163 papers in training set
Top 6%
0.9%
27
JNCI: Journal of the National Cancer Institute
16 papers in training set
Top 0.6%
0.8%
28
JNCI Cancer Spectrum
10 papers in training set
Top 0.5%
0.8%
29
GENETICS
189 papers in training set
Top 1%
0.7%
30
JAMA
17 papers in training set
Top 0.4%
0.7%