Benchmarking Polygenic Risk Score Model Assumptions: towards more accurate risk assessment

Kulm, S.; Mezey, J.; Elemento, O.

2022-02-18 genomics

10.1101/2022.02.18.480983 bioRxiv

Show abstract

Polygenic risk scores represent an individuals genetic susceptibility to a phenotype. Like with any models, statistical models commonly employed to fit polygenic risk scores and assess their accuracy contain several assumptions. The effects of these assumptions on models of polygenic risk score have not been thoroughly assessed. We assessed 26 variations of the traditional polygenic risk score model, each of which mitigate assumptions in one of five facets of disease modelling: representation of age (6 variations), censorship (3 variations), competing risks (7 variations), formation of disease labels (6 variations), and selection of covariates (4 variations). With data from the UK Biobank, each model variation included age, sex, and a polygenic risk score derived from the PGS Catalog. Each of the 26 model variations were fitted to predict 18 diseases. Compared to the plain model that contained all five facets of assumptions, the model variations often fit the data better and generated predictions that largely differed from the predictions of the plain model. The statistic Roystons R2 measured a models goodness of fit, and thereby determined if the model was an enhancement upon the plain model. For 15 of the 26 model variations Roystons R2 was greater than that of the plain model for >50% of diseases. Reclassification rates, defined as the fraction of individuals in the top five percentiles of the plain models predictions who are not in the top five percentiles of a model variations predictions, was used to determine if the variation led to significantly different predictions. For 20 of the 26 model variations the median reclassification rate calculated across the 18 diseases was greater than 10%. Comparisons of accuracy statistics further illustrated how much each model variations predictions differed from the plain models predictions. Models containing polygenic risk scores appear to be significantly affected by many common modelling assumptions. Therefore, future investigations should consider taking some action to mitigate modelling assumptions. Author SummaryAn individuals genetics can increase their risk of experiencing a disease. The exact magnitude of the increased risk is estimated within a statistical model. The traditional model type employed in this process is relatively plain and contains several assumptions. The predicted risk estimates from this plain model may be unnecessarily inaccurate. To test this possibility, we searched the literature for model variations that reduce the assumptions of the plain model, ultimately creating 26 distinct model variations that may improve upon the plain model. Each model variation was fit with data from the UK Biobank to predict 18 diseases. We found that 15 of the 26 models variations fit the data better than the plain model for a majority of diseases. Goodness of fit was measured with Roystons R2 statistic. Further calculations found that the predictions of the model variations were often significantly more or less accurate than the predictions of the plain model. We believe these results indicate that future investigations of polygenic risk scores should not employ the plain model, as unreliable risk predictions will likely result.

Benchmarking Polygenic Risk Score Model Assumptions: towards more accurate risk assessment

Matching journals