Back

PhenoSS: Phenotype semantic similarity-based approach for rare disease prediction and patient clustering

Chen, S.; Nguyen, Q. M.; Hu, Y.; Liu, C.; Weng, C.; Wang, K.

2026-03-02 health informatics
10.64898/2026.02.26.26347219 medRxiv
Show abstract

ObjectiveSystematic clinical phenotyping using Human Phenotype Ontology (HPO) is central to rare disease diagnosis. However, current disease prioritization (ranking candidate diseases from HPO for a patient) methods face key challenges: they often fail to account for the hierarchical structure of HPO terms, ignore dependencies among correlated terms, and do not adjust for batch effects arising from systematic differences in phenotype documentation across cohorts, institutions, or clinicians. We aim to develop a scalable and statistically principled framework to address these limitations for rare disease prediction and patient stratification. MethodsWe developed PhenoSS, a Gaussian copula-based framework that models disease-specific marginal prevalence of HPO terms while capturing their joint dependencies through a multivariate normal distribution. Phenotype frequencies were estimated using external curated resources, including OARD (Open Annotations for Rare Diseases) and HPO annotations. PhenoSS supports both pair-wise phenotype similarity calculation for patient clustering and posterior odds estimation for patient-specific disease prioritization. A batch-effect correction module mitigates systematic phenotyping differences across datasets. ResultsAcross diverse simulation scenarios, PhenoSS demonstrated robust disease-prediction performance and consistently improved accuracy after batch-effect correction. In real electronic health record (EHR) data, PhenoSS identified clinically meaningful patient clusters and effectively distinguished patients with different rare diseases. In disease prioritization tasks, PhenoSS achieved competitive performance with existing methods, particularly for patients exhibiting sparse or noisy phenotype annotations. ConclusionPhenoSS provides a statistically interpretable framework for modeling phenotypic heterogeneity in rare disease research and is adaptable to other structured clinical vocabularies such as SNOMED-CT and ICD codes.

Matching journals

The top 3 journals account for 50% of the predicted probability mass.

1
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.1%
28.7%
2
Journal of Biomedical Informatics
45 papers in training set
Top 0.1%
20.2%
3
Bioinformatics
1061 papers in training set
Top 3%
7.4%
50% of probability mass above
4
JAMIA Open
37 papers in training set
Top 0.2%
5.0%
5
BMC Medical Informatics and Decision Making
39 papers in training set
Top 0.6%
4.5%
6
BMC Bioinformatics
383 papers in training set
Top 3%
3.2%
7
JCO Clinical Cancer Informatics
18 papers in training set
Top 0.3%
2.2%
8
Scientific Reports
3102 papers in training set
Top 52%
2.0%
9
Med
38 papers in training set
Top 0.2%
1.9%
10
International Journal of Medical Informatics
25 papers in training set
Top 0.8%
1.8%
11
npj Digital Medicine
97 papers in training set
Top 2%
1.7%
12
JMIR Medical Informatics
17 papers in training set
Top 0.9%
1.4%
13
GENETICS
189 papers in training set
Top 0.9%
1.3%
14
PLOS ONE
4510 papers in training set
Top 61%
1.1%
15
European Journal of Epidemiology
40 papers in training set
Top 0.5%
1.0%
16
eBioMedicine
130 papers in training set
Top 3%
0.9%
17
BMC Medical Research Methodology
43 papers in training set
Top 1%
0.8%
18
Nature Communications
4913 papers in training set
Top 60%
0.8%
19
PLOS Digital Health
91 papers in training set
Top 2%
0.8%
20
The Lancet Digital Health
25 papers in training set
Top 0.9%
0.8%
21
iScience
1063 papers in training set
Top 30%
0.8%
22
Computer Methods and Programs in Biomedicine
27 papers in training set
Top 0.9%
0.8%
23
Journal of Medical Internet Research
85 papers in training set
Top 5%
0.7%
24
GigaScience
172 papers in training set
Top 3%
0.7%
25
Nature Medicine
117 papers in training set
Top 5%
0.7%
26
Genetics in Medicine
69 papers in training set
Top 1%
0.7%