Back

Reducing Information and Selection Bias in EHR-Linked Biobanks via Genetics-Informed Multiple Imputation and Sample Weighting

Salvatore, M.; Kundu, R.; Du, J.; Friese, C. R.; Mondul, A. M.; Hanauer, D. A.; Lu, H.; Pearce, C. L.; Mukherjee, B.

2024-10-29 epidemiology
10.1101/2024.10.28.24316286 medRxiv
Show abstract

Electronic health records (EHRs) are valuable for public health and clinical research but are prone to many sources of bias, including missing data and non-probability selection. Missing data in EHRs is complex due to potential non-recording, fragmentation, or clinically informative absences. This study explores whether polygenic risk score (PRS)-informed multiple imputation for missing traits, combined with sample weighting, can mitigate missing data and selection biases in estimating disease-exposure associations. Simulations were conducted for missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR) conditions under different sampling mechanisms. PRS-informed multiple imputation showed generally lower bias, particularly when combined with sample weighting. For example, in biased samples of 10,000 with exposure and outcome MAR data, PRS-informed imputation had lower percent bias (3.8%) and better coverage rate (0.883) compared to PRS-uninformed (4.5%; 0.877) and complete case analyses (10.3%; 0.784) in covariate-adjusted, weighted, multiple imputation scenarios. In a case study using Michigan Genomics Initiative (n=50,026) data, PRS-informed imputation aligned more closely with a sample-weighted All of Us-derived benchmark than analyses ignoring missing data and selection bias. Researchers should consider leveraging genetic data and sample weighting to address biases from missing data and non-probability sampling in biobanks.

Matching journals

The top 3 journals account for 50% of the predicted probability mass.

1
American Journal of Epidemiology
57 papers in training set
Top 0.1%
37.9%
2
BMC Medical Research Methodology
43 papers in training set
Top 0.1%
10.1%
3
International Journal of Epidemiology
74 papers in training set
Top 0.1%
10.1%
50% of probability mass above
4
Epidemiology
26 papers in training set
Top 0.1%
3.6%
5
Journal of Biomedical Informatics
45 papers in training set
Top 0.4%
3.6%
6
PLOS ONE
4510 papers in training set
Top 45%
2.6%
7
The American Journal of Human Genetics
206 papers in training set
Top 2%
2.6%
8
European Journal of Epidemiology
40 papers in training set
Top 0.3%
1.9%
9
Journal of the American Medical Informatics Association
61 papers in training set
Top 1%
1.9%
10
Nature Communications
4913 papers in training set
Top 53%
1.5%
11
Genetic Epidemiology
46 papers in training set
Top 0.5%
1.5%
12
Statistics in Medicine
34 papers in training set
Top 0.2%
1.3%
13
eLife
5422 papers in training set
Top 49%
1.2%
14
Scientific Reports
3102 papers in training set
Top 69%
1.0%
15
PLOS Genetics
756 papers in training set
Top 13%
0.9%
16
PLOS Computational Biology
1633 papers in training set
Top 22%
0.9%
17
npj Digital Medicine
97 papers in training set
Top 3%
0.8%
18
Nature Human Behaviour
85 papers in training set
Top 4%
0.8%
19
Epidemiology and Infection
84 papers in training set
Top 3%
0.8%
20
Genome Medicine
154 papers in training set
Top 8%
0.7%
21
Bioinformatics
1061 papers in training set
Top 10%
0.7%
22
Biometrics
22 papers in training set
Top 0.2%
0.7%
23
JAMIA Open
37 papers in training set
Top 2%
0.7%
24
Patterns
70 papers in training set
Top 3%
0.7%
25
GENETICS
189 papers in training set
Top 2%
0.6%
26
Pharmacoepidemiology and Drug Safety
13 papers in training set
Top 0.5%
0.6%
27
Briefings in Bioinformatics
326 papers in training set
Top 7%
0.6%