Back

Evaluating Genetic-Based Disease Prediction Approaches Through Simulation

Shpak, M.; Parfitt, E.; Mahmoudiandehkordi, S.; Maadooliat, M.; Schrodi, S. J.

2025-03-26 genetic and genomic medicine
10.1101/2025.03.21.25324431 medRxiv
Show abstract

Common diseases exhibit substantial heritability, and GWAS of these diseases have revealed hundreds of thousands of high-frequency disease susceptibility variants throughout the genome. These studies offer the prospect of using genomic data to improve disease prediction and diagnosis, however, the relative performance of different predictive modeling approaches is not well-characterized. To investigate this systematically, we constructed a Monte Carlo simulation generating model genomes with large numbers of SNPs, with a proportion of SNPs carrying risk alleles that are parameterized by the strength of their effects and by different modes of inheritance - additive, dominant, recessive, and combinations thereof. After generating genotypes for cases and controls, several machine learning classifiers (logistic regression, naive Bayes, random forests, and neural networks, with and without feature selection) were applied to predict disease phenotype from genotypes. Each classifiers rates of false positives and false negatives were evaluated and compared using AUC. We found that random forest models were the most accurate predictors of disease phenotype over the range of inheritance parameters, followed by logistic regression and naive Bayes, while the feedforward multilayer neural network-based predictive model had lower AUC. Furthermore, with the small fraction of null sites in our model, there was almost no difference in the performance of classifiers with or without LASSO-based feature selection. We also investigate the association of AUC with the difference in polygenic risk score (PRS) between disease and control samples by comparing AUC in the simulations to the values predicted from the PRS distributions based on odds-risk and liability models.

Matching journals

The top 4 journals account for 50% of the predicted probability mass.

1
Frontiers in Genetics
197 papers in training set
Top 0.1%
33.0%
2
Genetic Epidemiology
46 papers in training set
Top 0.1%
8.4%
3
Human Genetics
25 papers in training set
Top 0.1%
8.4%
4
Scientific Reports
3102 papers in training set
Top 18%
6.4%
50% of probability mass above
5
Human Molecular Genetics
130 papers in training set
Top 0.5%
4.3%
6
PLOS Genetics
756 papers in training set
Top 4%
3.6%
7
Human Genetics and Genomics Advances
70 papers in training set
Top 0.1%
3.6%
8
BMC Medical Genomics
36 papers in training set
Top 0.1%
3.6%
9
PLOS ONE
4510 papers in training set
Top 39%
3.6%
10
PLOS Computational Biology
1633 papers in training set
Top 14%
2.1%
11
Briefings in Bioinformatics
326 papers in training set
Top 4%
1.8%
12
npj Genomic Medicine
33 papers in training set
Top 0.4%
1.7%
13
Human Genomics
21 papers in training set
Top 0.1%
1.5%
14
Genome Medicine
154 papers in training set
Top 5%
1.3%
15
European Journal of Human Genetics
49 papers in training set
Top 1.0%
1.0%
16
Bioinformatics
1061 papers in training set
Top 9%
0.9%
17
Human Mutation
29 papers in training set
Top 0.6%
0.9%
18
Frontiers in Molecular Biosciences
100 papers in training set
Top 4%
0.9%
19
The American Journal of Human Genetics
206 papers in training set
Top 3%
0.8%
20
Journal of Personalized Medicine
28 papers in training set
Top 1%
0.8%
21
International Journal of Molecular Sciences
453 papers in training set
Top 14%
0.8%
22
BMC Genomics
328 papers in training set
Top 6%
0.7%
23
Frontiers in Neuroscience
223 papers in training set
Top 7%
0.7%
24
Genomics
60 papers in training set
Top 3%
0.7%