Back

PhenotypeToGeneDownloaderR: automated multi-source retrieval and validation of phenotype-associated genes

Muneeb, M.; Ascher, D. B.

2026-05-06 bioinformatics
10.64898/2026.05.02.721360 bioRxiv
Show abstract

MotivationIdentifying phenotype-associated genes is a common first step in polygenic risk score construction, enrichment testing, target prioritisation and variant interpretation, but relevant evidence is distributed across heterogeneous databases with different interfaces, formats and evidence models. ResultsWe present PhenotypeToGeneDownloaderR, a phenotype-guided R/Python pipeline for automated gene retrieval, harmonisation, symbol validation and cross-source summary analysis. Given a phenotype term, the pipeline queries integrated biological databases, standardises per-source outputs, combines gene lists, validates retrieved symbols against the NCBI human gene reference and generates summary tables and visualisations. Across 13 clinically relevant phenotypes and 13 databases, PhenotypeToGeneDownloaderR generated 136,487 raw gene retrievals, with at least one source returning genes for every phenotype. Across all 13 phenotypes, 100,175 of 114,345 combined input symbols were retained after direct or synonym-based validation, corresponding to an 87.6% validation rate. Cross-source overlap was low, supporting the complementarity of integrated evidence sources. Against an HPO/ClinVar/OMIM-derived gold standard, the pipeline recovered 1,039 of 1,056 known phenotype-associated genes, corresponding to 98.4% recall. PhenotypeToGeneDownloaderR provides a lightweight, reproducible upstream framework for generating candidate gene sets for downstream prioritisation and interpretation. Availability and implementationPhenotypeToGeneDownloaderR is implemented in R and Python, released under the MIT licence, and available at https://github.com/MuhammadMuneeb007/PhenotypeToGeneDownloaderR. Supplementary informationSupplementary data are available online.

Matching journals

The top 2 journals account for 50% of the predicted probability mass.

1
Bioinformatics
1061 papers in training set
Top 0.5%
38.0%
2
Genome Medicine
154 papers in training set
Top 0.2%
14.4%
50% of probability mass above
3
Nature Communications
4913 papers in training set
Top 18%
10.2%
4
Nucleic Acids Research
1128 papers in training set
Top 4%
4.4%
5
The American Journal of Human Genetics
206 papers in training set
Top 1%
4.3%
6
Nature Genetics
240 papers in training set
Top 3%
2.4%
7
Bioinformatics Advances
184 papers in training set
Top 2%
2.1%
8
BMC Bioinformatics
383 papers in training set
Top 4%
1.7%
9
Genome Biology
555 papers in training set
Top 4%
1.7%
10
Database
51 papers in training set
Top 0.5%
1.5%
11
European Journal of Human Genetics
49 papers in training set
Top 0.9%
1.1%
12
Nature Medicine
117 papers in training set
Top 4%
0.9%
13
Scientific Reports
3102 papers in training set
Top 70%
0.9%
14
Cell Genomics
162 papers in training set
Top 6%
0.8%
15
PLOS Computational Biology
1633 papers in training set
Top 23%
0.8%
16
Nature
575 papers in training set
Top 16%
0.8%
17
Genetics in Medicine
69 papers in training set
Top 1.0%
0.8%
18
PLOS ONE
4510 papers in training set
Top 69%
0.7%
19
PLOS Genetics
756 papers in training set
Top 16%
0.7%
20
Nature Methods
336 papers in training set
Top 6%
0.7%
21
BioData Mining
15 papers in training set
Top 1%
0.6%
22
npj Digital Medicine
97 papers in training set
Top 4%
0.6%
23
JCO Clinical Cancer Informatics
18 papers in training set
Top 1.0%
0.6%
24
npj Genomic Medicine
33 papers in training set
Top 1%
0.5%
25
Briefings in Bioinformatics
326 papers in training set
Top 8%
0.5%