Back

Overestimated Polygenic Prediction due to Overlapping Subjects in Genetic Datasets

Park, D. K.; Chen, M.; Kim, S.; Joo, Y. Y.; Loving, R.; Kim, H.-S.; Cha, J.; Yoo, S.; Kim, J. H.

2022-01-22 genomics
10.1101/2022.01.19.476997 bioRxiv
Show abstract

Recently, polygenic risk score (PRS) has gained significant attention in studies involving complex genetic diseases and traits. PRS is often derived from summary statistics, from which the independence between discovery and replication sets cannot be monitored. Prior studies, in which the independence is strictly observed, report a relatively low gain from PRS in predictive models of binary traits. We hypothesize that the independence assumption may be compromised when using the summary statistics, and suspect an overestimation bias in the predictive accuracy. To demonstrate the overestimation bias in the replication dataset, prediction performances of PRS models are compared when overlapping subjects are either present or removed. We consider the task of Alzheimers disease (AD) prediction across genetics datasets, including the International Genomics of Alzheimers Project (IGAP), AD Sequencing Project (ADSP), and Accelerating Medicine Partnership - Alzheimers Disease (AMP-AD). PRS is computed from either sequencing studies for ADSP and AMP-AD (denoted as rPRS) or the summary statistics for IGAP (sPRS). Two variables with the high heritability in UK Biobank, hypertension, and height, are used to derive an exemplary scale effect of PRS. Based on the scale effect, the expected performance of sPRS is computed for AD prediction. Using ADSP as a discovery set for rPRS on AMP-AD, {Delta}AUC and {Delta}R2 (performance gains in AUC and R2 by PRS) record 0.069 and 0.11, respectively. Both drop to 0.0017 and 0.0041 once overlapping subjects are removed from AMP-AD. sPRS is derived from IGAP, which records {Delta}AUC and {Delta}R2 of 0.051{+/-}0.013 and 0.063{+/-}0.015 for ADSP and 0.060 and 0.086 for AMP-AD, respectively. On UK Biobank, rPRS performances for hypertension assuming a similar size of discovery and replication sets are 0.0036{+/-}0.0027 ({Delta}AUC) and 0.0032{+/-}0.0028 ({Delta}R2). For height, {Delta}R2 is 0.029{+/-}0.0037. Considering the high heritability of hypertension and height of UK Biobank, we conclude that sPRS results from AD databases are inflated. The higher performances relative to the size of the discovery set were observed in PRS studies of several diseases. PRS performances for binary traits, such as AD and hypertension, turned out unexpectedly low. This may, along with the difference in linkage disequilibrium, explain the high variability of PRS performances in cross-nation or cross-ethnicity applications, i.e., when there are no overlapping subjects. Hence, for sPRS, potential duplications should be carefully considered within the same ethnic group.

Matching journals

The top 4 journals account for 50% of the predicted probability mass.

1
Genetic Epidemiology
46 papers in training set
Top 0.1%
28.4%
2
Scientific Reports
3102 papers in training set
Top 5%
10.4%
3
Frontiers in Genetics
197 papers in training set
Top 0.6%
7.0%
4
PLOS ONE
4510 papers in training set
Top 26%
6.5%
50% of probability mass above
5
European Journal of Human Genetics
49 papers in training set
Top 0.2%
4.4%
6
PLOS Computational Biology
1633 papers in training set
Top 8%
4.1%
7
BMC Medical Genomics
36 papers in training set
Top 0.1%
3.7%
8
Journal of Personalized Medicine
28 papers in training set
Top 0.1%
3.0%
9
International Journal of Epidemiology
74 papers in training set
Top 0.8%
2.8%
10
BioData Mining
15 papers in training set
Top 0.2%
1.9%
11
F1000Research
79 papers in training set
Top 2%
1.7%
12
Journal of Medical Genetics
28 papers in training set
Top 0.3%
1.7%
13
BMC Genomics
328 papers in training set
Top 3%
1.4%
14
PLOS Genetics
756 papers in training set
Top 11%
1.3%
15
BMC Bioinformatics
383 papers in training set
Top 5%
1.3%
16
Human Genetics
25 papers in training set
Top 0.2%
1.3%
17
Genes
126 papers in training set
Top 2%
1.1%
18
Bioinformatics
1061 papers in training set
Top 9%
0.9%
19
npj Genomic Medicine
33 papers in training set
Top 0.8%
0.8%
20
Human Molecular Genetics
130 papers in training set
Top 3%
0.8%
21
Journal of Alzheimer’s Disease
39 papers in training set
Top 1%
0.7%
22
Brain Communications
147 papers in training set
Top 3%
0.7%
23
Briefings in Bioinformatics
326 papers in training set
Top 7%
0.7%
24
The American Journal of Human Genetics
206 papers in training set
Top 4%
0.7%
25
GENETICS
189 papers in training set
Top 2%
0.5%
26
Database
51 papers in training set
Top 1%
0.5%
27
Computational and Structural Biotechnology Journal
216 papers in training set
Top 12%
0.5%
28
European Journal of Epidemiology
40 papers in training set
Top 1.0%
0.5%
29
Nucleic Acids Research
1128 papers in training set
Top 21%
0.5%