Back

GP-ML-DC: An Ensemble Machine Learning-Based Genomic Prediction Approach with Automated Two-Phase Dimensionality Reduction via Divide-and-Conquer Techniques

Liu, Q.; Ma, H.; Zhang, Z.; Hu, Z.; Wang, X.; Li, R.; Cai, Y.; Jiang, Y.

2024-12-26 bioinformatics
10.1101/2024.12.26.630443 bioRxiv
Show abstract

Traditional machine learning (ML) and deep learning (DL) methods for genome prediction often face challenges due to the imbalance between the limited number of samples (n) and the large number of single nucleotide polymorphisms (SNPs) (p), where n is much smaller than p. To address this, we propose GP-ML-DC, an innovative genome predictor that combines traditional ML and DL models with a unique two-phase, parameter-free dimensionality reduction technique. Initially, GP-ML-DC reduces feature dimensionality by characterizing genes as features. Building on big data methodologies, it employs a divide-and-conquer approach to segment gene regions into multiple haplotypes, further decreasing dimensionality. Each haplotype segment is processed by a sub-task based on traditional ML, followed by integration via a neural network that synthesizes the results of all sub-tasks. Our experiments, conducted on four cattle milk-related traits using ten-fold cross-validation and independent testing, show that GP-ML-DC significantly surpasses current state-of-the-art genome predictors in prediction performance.

Matching journals

The top 8 journals account for 50% of the predicted probability mass.

1
Bioinformatics Advances
184 papers in training set
Top 0.2%
8.3%
2
Frontiers in Genetics
197 papers in training set
Top 0.5%
8.1%
3
BMC Bioinformatics
383 papers in training set
Top 1%
7.1%
4
IEEE Transactions on Computational Biology and Bioinformatics
17 papers in training set
Top 0.1%
6.7%
5
Bioinformatics
1061 papers in training set
Top 4%
6.7%
6
Briefings in Bioinformatics
326 papers in training set
Top 0.9%
6.3%
7
Genome Research
409 papers in training set
Top 0.4%
6.3%
8
BMC Genomics
328 papers in training set
Top 0.4%
4.8%
50% of probability mass above
9
PLOS ONE
4510 papers in training set
Top 36%
3.9%
10
IEEE Journal of Biomedical and Health Informatics
34 papers in training set
Top 0.5%
3.6%
11
PLOS Computational Biology
1633 papers in training set
Top 11%
3.0%
12
Advanced Science
249 papers in training set
Top 8%
2.4%
13
Scientific Reports
3102 papers in training set
Top 47%
2.4%
14
Communications Biology
886 papers in training set
Top 4%
2.3%
15
Nature Communications
4913 papers in training set
Top 47%
2.1%
16
NAR Genomics and Bioinformatics
214 papers in training set
Top 2%
1.9%
17
IEEE/ACM Transactions on Computational Biology and Bioinformatics
32 papers in training set
Top 0.2%
1.8%
18
iScience
1063 papers in training set
Top 15%
1.7%
19
Genomics, Proteomics & Bioinformatics
171 papers in training set
Top 4%
1.7%
20
Nucleic Acids Research
1128 papers in training set
Top 14%
1.2%
21
Genome Medicine
154 papers in training set
Top 7%
0.9%
22
Computational and Structural Biotechnology Journal
216 papers in training set
Top 8%
0.9%
23
PLOS Genetics
756 papers in training set
Top 13%
0.9%
24
Nature Machine Intelligence
61 papers in training set
Top 3%
0.8%
25
Journal of Computational Biology
37 papers in training set
Top 0.5%
0.8%
26
Genome Biology
555 papers in training set
Top 7%
0.8%
27
GigaScience
172 papers in training set
Top 3%
0.7%
28
Cell Systems
167 papers in training set
Top 13%
0.7%
29
Frontiers in Immunology
586 papers in training set
Top 9%
0.6%