Back

Comparing the XGBoost machine learning algorithm to polygenic scoring for the prediction of intelligence based on genotype data

Fahey, L.; Morris, D. W.; O Broin, P.

2022-06-15 bioinformatics
10.1101/2022.06.12.495467 bioRxiv
Show abstract

A polygenic score (PGS) is a linear combination of effects from a GWAS that represents and can be used to predict genetic predisposition to a particular phenotype. A key limitation of the PGS method is that it assumes additive and independent SNP effects, when it is known that epistasis (gene interactions) can contribute to complex traits. Machine learning methods can potentially overcome this limitation by virtue of their ability to capture nonlinear interactions in high dimensional data. Intelligence is a complex trait for which PGS prediction currently explains up to 5.2% of the variance, a relatively small proportion of the heritability estimate of 50% obtained from twin studies. Here, we use gradient boosting, a machine learning technique based on an ensemble of weak prediction models, to predict intelligence from genotype data. We found that while gradient boosting did not outperform the PGS method in predicting intelligence based on SNP data, it was capable of achieving similar predictive performance with less than a quarter of the SNPs with the top SNPs identified as being important for predictive performance being biologically meaningful. These results indicate that ML methods may be useful in interpreting the biological meaning underpinning SNP-phenotype associations due to the smaller number of SNPs required in the ML model as opposed to the standard PGS method based on GWAS.

Matching journals

The top 4 journals account for 50% of the predicted probability mass.

1
Scientific Reports
3102 papers in training set
Top 2%
14.5%
2
BMC Bioinformatics
383 papers in training set
Top 0.5%
14.5%
3
Frontiers in Genetics
197 papers in training set
Top 0.2%
12.8%
4
BioData Mining
15 papers in training set
Top 0.1%
10.2%
50% of probability mass above
5
Bioinformatics
1061 papers in training set
Top 4%
4.9%
6
PLOS ONE
4510 papers in training set
Top 31%
4.9%
7
PLOS Genetics
756 papers in training set
Top 4%
4.0%
8
Genetic Epidemiology
46 papers in training set
Top 0.2%
3.7%
9
Bioinformatics Advances
184 papers in training set
Top 2%
3.1%
10
PLOS Computational Biology
1633 papers in training set
Top 13%
2.4%
11
PeerJ
261 papers in training set
Top 6%
1.9%
12
European Journal of Human Genetics
49 papers in training set
Top 0.8%
1.2%
13
NeuroImage
813 papers in training set
Top 5%
1.2%
14
Human Molecular Genetics
130 papers in training set
Top 2%
1.2%
15
Communications Biology
886 papers in training set
Top 17%
1.0%
16
Frontiers in Neuroscience
223 papers in training set
Top 6%
0.9%
17
Statistics in Medicine
34 papers in training set
Top 0.3%
0.8%
18
BMC Genomics
328 papers in training set
Top 6%
0.8%
19
G3 Genes|Genomes|Genetics
351 papers in training set
Top 3%
0.7%
20
Frontiers in Human Neuroscience
67 papers in training set
Top 3%
0.7%
21
Briefings in Bioinformatics
326 papers in training set
Top 7%
0.6%
22
Biostatistics
21 papers in training set
Top 0.2%
0.6%
23
Genes
126 papers in training set
Top 4%
0.6%
24
Behavior Genetics
15 papers in training set
Top 0.2%
0.5%
25
Human Genetics and Genomics Advances
70 papers in training set
Top 1%
0.5%
26
Human Brain Mapping
295 papers in training set
Top 5%
0.5%
27
International Journal of Obesity
25 papers in training set
Top 0.7%
0.5%