Back

Benchmarking Polygenic Risk Score Model Assumptions: towards more accurate risk assessment

Kulm, S.; Mezey, J.; Elemento, O.

2022-02-18 genomics
10.1101/2022.02.18.480983 bioRxiv
Show abstract

Polygenic risk scores represent an individuals genetic susceptibility to a phenotype. Like with any models, statistical models commonly employed to fit polygenic risk scores and assess their accuracy contain several assumptions. The effects of these assumptions on models of polygenic risk score have not been thoroughly assessed. We assessed 26 variations of the traditional polygenic risk score model, each of which mitigate assumptions in one of five facets of disease modelling: representation of age (6 variations), censorship (3 variations), competing risks (7 variations), formation of disease labels (6 variations), and selection of covariates (4 variations). With data from the UK Biobank, each model variation included age, sex, and a polygenic risk score derived from the PGS Catalog. Each of the 26 model variations were fitted to predict 18 diseases. Compared to the plain model that contained all five facets of assumptions, the model variations often fit the data better and generated predictions that largely differed from the predictions of the plain model. The statistic Roystons R2 measured a models goodness of fit, and thereby determined if the model was an enhancement upon the plain model. For 15 of the 26 model variations Roystons R2 was greater than that of the plain model for >50% of diseases. Reclassification rates, defined as the fraction of individuals in the top five percentiles of the plain models predictions who are not in the top five percentiles of a model variations predictions, was used to determine if the variation led to significantly different predictions. For 20 of the 26 model variations the median reclassification rate calculated across the 18 diseases was greater than 10%. Comparisons of accuracy statistics further illustrated how much each model variations predictions differed from the plain models predictions. Models containing polygenic risk scores appear to be significantly affected by many common modelling assumptions. Therefore, future investigations should consider taking some action to mitigate modelling assumptions. Author SummaryAn individuals genetics can increase their risk of experiencing a disease. The exact magnitude of the increased risk is estimated within a statistical model. The traditional model type employed in this process is relatively plain and contains several assumptions. The predicted risk estimates from this plain model may be unnecessarily inaccurate. To test this possibility, we searched the literature for model variations that reduce the assumptions of the plain model, ultimately creating 26 distinct model variations that may improve upon the plain model. Each model variation was fit with data from the UK Biobank to predict 18 diseases. We found that 15 of the 26 models variations fit the data better than the plain model for a majority of diseases. Goodness of fit was measured with Roystons R2 statistic. Further calculations found that the predictions of the model variations were often significantly more or less accurate than the predictions of the plain model. We believe these results indicate that future investigations of polygenic risk scores should not employ the plain model, as unreliable risk predictions will likely result.

Matching journals

The top 7 journals account for 50% of the predicted probability mass.

1
Journal of Medical Genetics
28 papers in training set
Top 0.1%
18.5%
2
European Journal of Human Genetics
49 papers in training set
Top 0.1%
8.4%
3
PLOS ONE
4510 papers in training set
Top 28%
6.4%
4
Genetic Epidemiology
46 papers in training set
Top 0.1%
6.4%
5
Frontiers in Genetics
197 papers in training set
Top 1%
4.8%
6
BioData Mining
15 papers in training set
Top 0.1%
4.8%
7
Scientific Reports
3102 papers in training set
Top 37%
3.6%
50% of probability mass above
8
GENETICS
189 papers in training set
Top 0.3%
3.6%
9
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.8%
3.6%
10
Genetics in Medicine
69 papers in training set
Top 0.4%
3.6%
11
BMC Medical Genomics
36 papers in training set
Top 0.2%
3.1%
12
PLOS Computational Biology
1633 papers in training set
Top 11%
2.9%
13
Human Mutation
29 papers in training set
Top 0.3%
2.1%
14
F1000Research
79 papers in training set
Top 2%
1.7%
15
PLOS Genetics
756 papers in training set
Top 10%
1.5%
16
PeerJ
261 papers in training set
Top 8%
1.5%
17
Genes
126 papers in training set
Top 1%
1.3%
18
The American Journal of Human Genetics
206 papers in training set
Top 3%
1.2%
19
Journal of Personalized Medicine
28 papers in training set
Top 0.6%
1.2%
20
Computational and Structural Biotechnology Journal
216 papers in training set
Top 7%
1.1%
21
G3 Genes|Genomes|Genetics
351 papers in training set
Top 2%
1.1%
22
Bioinformatics
1061 papers in training set
Top 9%
0.9%
23
Human Molecular Genetics
130 papers in training set
Top 3%
0.9%
24
Wellcome Open Research
57 papers in training set
Top 2%
0.8%
25
International Journal of Epidemiology
74 papers in training set
Top 3%
0.7%
26
BMC Genomics
328 papers in training set
Top 6%
0.7%
27
Nucleic Acids Research
1128 papers in training set
Top 20%
0.6%
28
IEEE/ACM Transactions on Computational Biology and Bioinformatics
32 papers in training set
Top 0.7%
0.6%