Back

Can machine learning aid in identifying disease genes? The case of autism spectrum disorder

Gunning, M.; Pavlidis, P.

2020-11-27 bioinformatics
10.1101/2020.11.26.394676 bioRxiv
Show abstract

Discovering genes involved in complex human genetic disorders is a major challenge. Many have suggested that machine learning (ML) algorithms using gene networks can be used to supplement traditional genetic association-based approaches to predict or prioritize disease genes. However, questions have been raised about the utility of ML methods for this type of task due to biases within the data, and poor real-world performance. Using autism spectrum disorder (ASD) as a test case, we sought to investigate the question: Can machine learning aid in the discovery of disease genes? We collected thirteen published ASD gene prioritization studies and evaluated their performance using known and novel high-confidence ASD genes. We also investigated their biases towards generic gene annotations, like number of association publications. We found that ML methods which do not incorporate genetics information have limited utility for prioritization of ASD risk genes. These studies perform at a comparable level to generic measures of likelihood for the involvement of genes in any condition, and do not out-perform genetic association studies. Future efforts to discover disease genes should be focused on developing and validating statistical models for genetic association, specifically for association between rare variants and disease, rather than developing complex machine learning methods using complex heterogeneous biological data with unknown reliability.

Matching journals

The top 5 journals account for 50% of the predicted probability mass.

1
BMC Bioinformatics
383 papers in training set
Top 0.2%
22.7%
2
BioData Mining
15 papers in training set
Top 0.1%
10.2%
3
PLOS Computational Biology
1633 papers in training set
Top 5%
6.4%
4
Scientific Reports
3102 papers in training set
Top 17%
6.4%
5
Frontiers in Genetics
197 papers in training set
Top 0.8%
6.4%
50% of probability mass above
6
PLOS ONE
4510 papers in training set
Top 31%
4.9%
7
Bioinformatics Advances
184 papers in training set
Top 1%
4.0%
8
npj Genomic Medicine
33 papers in training set
Top 0.1%
3.6%
9
Bioinformatics
1061 papers in training set
Top 6%
2.6%
10
Briefings in Bioinformatics
326 papers in training set
Top 4%
1.7%
11
Genetic Epidemiology
46 papers in training set
Top 0.4%
1.7%
12
PeerJ
261 papers in training set
Top 7%
1.7%
13
Frontiers in Neuroscience
223 papers in training set
Top 4%
1.5%
14
Frontiers in Molecular Biosciences
100 papers in training set
Top 3%
1.0%
15
Disease Models & Mechanisms
119 papers in training set
Top 2%
1.0%
16
F1000Research
79 papers in training set
Top 3%
1.0%
17
European Journal of Human Genetics
49 papers in training set
Top 1.0%
1.0%
18
Journal of Personalized Medicine
28 papers in training set
Top 0.9%
0.9%
19
BMC Medical Genomics
36 papers in training set
Top 1%
0.8%
20
Communications Biology
886 papers in training set
Top 21%
0.8%
21
IEEE/ACM Transactions on Computational Biology and Bioinformatics
32 papers in training set
Top 0.5%
0.8%
22
Human Genetics and Genomics Advances
70 papers in training set
Top 0.8%
0.8%
23
Autism Research
32 papers in training set
Top 0.4%
0.8%
24
IEEE Journal of Biomedical and Health Informatics
34 papers in training set
Top 2%
0.7%
25
Neural Computation
36 papers in training set
Top 0.8%
0.7%
26
Genome Biology
555 papers in training set
Top 8%
0.7%
27
Journal of the American Medical Informatics Association
61 papers in training set
Top 2%
0.6%
28
Physiological Measurement
12 papers in training set
Top 0.5%
0.6%
29
GigaScience
172 papers in training set
Top 4%
0.6%
30
NAR Genomics and Bioinformatics
214 papers in training set
Top 5%
0.5%