Back

Identifying genes associated with phenotypes using machine and deep learning

Muneeb, M.; Ascher, D.

2026-03-07 bioinformatics
10.64898/2026.03.05.709665 bioRxiv
Show abstract

Identifying disease-associated genes enables the development of precision medicine and the understanding of biological processes. Genome-wide association studies (GWAS), gene expression data, biological pathway analysis, and protein network analysis are among the techniques used to identify causal genes. We propose a machine-learning (ML) and deep-learning (DL) pipeline to identify genes associated with a phenotype. The proposed pipeline consists of two interrelated processes. The first is classifying people into case/control based on the genotype data. The second is calculating feature importance to identify genes associated with a particular phenotype. We considered 30 phenotypes from the openSNP data for analysis, 21 ML algorithms, and 80 DL algorithms and variants. The best-performing ML and DL models, evaluated by the area under the curve (AUC), F1 score, and Matthews correlation coefficient (MCC), were used to identify important single-nucleotide polymorphisms (SNPs), and the identified SNPs were compared with the phenotype-associated SNPs from the GWAS Catalog. The mean per-phenotype gene identification ratio (GIR) was 0.84. These results suggest that SNPs selected by ML/DL algorithms that maximize classification performance can help prioritise phenotype-associated SNPs and genes, potentially supporting downstream studies aimed at understanding disease mechanisms and identifying candidate therapeutic targets.

Matching journals

The top 8 journals account for 50% of the predicted probability mass.

1
Bioinformatics
1061 papers in training set
Top 3%
10.2%
2
BMC Medical Genomics
36 papers in training set
Top 0.1%
9.9%
3
BioData Mining
15 papers in training set
Top 0.1%
8.2%
4
IEEE Journal of Biomedical and Health Informatics
34 papers in training set
Top 0.2%
6.2%
5
Scientific Reports
3102 papers in training set
Top 20%
6.2%
6
BMC Bioinformatics
383 papers in training set
Top 2%
4.8%
7
Briefings in Bioinformatics
326 papers in training set
Top 1%
4.2%
8
Frontiers in Genetics
197 papers in training set
Top 2%
3.8%
50% of probability mass above
9
Database
51 papers in training set
Top 0.2%
3.5%
10
PLOS ONE
4510 papers in training set
Top 41%
3.5%
11
Genomics, Proteomics & Bioinformatics
171 papers in training set
Top 2%
3.5%
12
Journal of Biomedical Informatics
45 papers in training set
Top 0.7%
2.0%
13
PLOS Computational Biology
1633 papers in training set
Top 15%
1.8%
14
IEEE Transactions on Computational Biology and Bioinformatics
17 papers in training set
Top 0.2%
1.8%
15
IEEE/ACM Transactions on Computational Biology and Bioinformatics
32 papers in training set
Top 0.2%
1.7%
16
Human Genetics
25 papers in training set
Top 0.2%
1.7%
17
NAR Genomics and Bioinformatics
214 papers in training set
Top 2%
1.6%
18
Genome Medicine
154 papers in training set
Top 5%
1.6%
19
Journal of Personalized Medicine
28 papers in training set
Top 0.5%
1.5%
20
Computational and Structural Biotechnology Journal
216 papers in training set
Top 6%
1.3%
21
International Journal of Molecular Sciences
453 papers in training set
Top 11%
1.2%
22
Frontiers in Molecular Biosciences
100 papers in training set
Top 3%
0.9%
23
Computers in Biology and Medicine
120 papers in training set
Top 4%
0.9%
24
Journal of Genetics and Genomics
36 papers in training set
Top 2%
0.8%
25
Nucleic Acids Research
1128 papers in training set
Top 18%
0.7%
26
Patterns
70 papers in training set
Top 3%
0.7%
27
Bioinformatics Advances
184 papers in training set
Top 5%
0.7%
28
BMC Medical Informatics and Decision Making
39 papers in training set
Top 3%
0.6%
29
Communications Biology
886 papers in training set
Top 30%
0.6%