Back

PAVS: A Standardized Database of Phenotype-Associated Variants from Saudi Arabian Rare Disease Patients

Abdelhakim, M.; Althagafi, A.; SCHOFIELD, P.; Hoehndorf, R.

2026-04-06 genetic and genomic medicine
10.64898/2026.04.05.26350189 medRxiv
Show abstract

Genotype-phenotype databases are essential for variant interpretation and disease gene discovery. Genetic variation differs among human populations, mainly in allele frequencies and haplotype patterns shaped by ancestry and demographic history. Population-specific genotypes can influence traits and disease risk; this makes population specific characterization important. Most existing resources focus on the characterization of a population's genetic background, but do not represent the resulting phenotypes. We have developed PAVS (Phenotype-Associated Variants in Saudi Arabia), a curated, publicly accessible database that integrates 5,132 Saudi clinical cases from four Saudi cohorts and 522 cases from analysis of a mixed-population cohort, together with 1,856 cases from the Deciphering Developmental Disorders study (DDD) and 9,588 literature phenopackets. Each case record describes patient-level phenotypes, encoded with the Human Phenotype Ontology (HPO), and links them to genomic variants, gene identifiers, zygosity, pathogenicity classifications, and disease diagnoses mapped to standardized disease terminologies. The data is represented in Phenopackets format and as a knowledge graph in RDF. Additionally, a web interface provides phenotype-based similarity search, gene and variant browsers, and an HPO hierarchy explorer. We evaluate the utility of the phenotype annotations for gene prioritization using semantic similarity. While there are clear differences to global literature-curated databases, phenotypes in PAVS can successfully rank the correct gene at high rank (ROCAUC: 0.89). PAVS addresses a gap in population-specific genotype-phenotype resources and provides a benchmark for phenotype-driven variant prioritization in under-represented populations.

Matching journals

The top 4 journals account for 50% of the predicted probability mass.

1
Genome Medicine
154 papers in training set
Top 0.1%
23.1%
2
Genetics in Medicine
69 papers in training set
Top 0.1%
15.1%
3
Human Mutation
29 papers in training set
Top 0.1%
10.3%
4
European Journal of Human Genetics
49 papers in training set
Top 0.1%
10.3%
50% of probability mass above
5
Nucleic Acids Research
1128 papers in training set
Top 5%
3.8%
6
npj Genomic Medicine
33 papers in training set
Top 0.1%
3.7%
7
Scientific Reports
3102 papers in training set
Top 43%
2.8%
8
Bioinformatics
1061 papers in training set
Top 6%
2.7%
9
Frontiers in Genetics
197 papers in training set
Top 5%
1.7%
10
Human Genetics
25 papers in training set
Top 0.1%
1.7%
11
The American Journal of Human Genetics
206 papers in training set
Top 2%
1.5%
12
BMC Bioinformatics
383 papers in training set
Top 5%
1.5%
13
Nature Genetics
240 papers in training set
Top 5%
1.4%
14
Journal of Medical Genetics
28 papers in training set
Top 0.3%
1.4%
15
Scientific Data
174 papers in training set
Top 1%
1.3%
16
Nature Communications
4913 papers in training set
Top 58%
1.0%
17
Nature Medicine
117 papers in training set
Top 4%
0.9%
18
Data in Brief
13 papers in training set
Top 0.3%
0.8%
19
Genomics, Proteomics & Bioinformatics
171 papers in training set
Top 6%
0.8%
20
PLOS ONE
4510 papers in training set
Top 67%
0.8%
21
GENETICS
189 papers in training set
Top 1%
0.7%
22
PLOS Computational Biology
1633 papers in training set
Top 25%
0.7%
23
Cell Genomics
162 papers in training set
Top 7%
0.7%
24
BMC Medical Genomics
36 papers in training set
Top 2%
0.7%
25
Human Genetics and Genomics Advances
70 papers in training set
Top 1.0%
0.7%
26
eBioMedicine
130 papers in training set
Top 5%
0.7%
27
Computational and Structural Biotechnology Journal
216 papers in training set
Top 12%
0.5%
28
Human Genomics
21 papers in training set
Top 0.5%
0.5%
29
Bioinformatics Advances
184 papers in training set
Top 5%
0.5%