Back

Solving the Diagnostic Odyssey with Synthetic Phenotype Data

Colangelo, G.; Marti, M.

2026-03-23 bioinformatics
10.64898/2026.03.19.712946 bioRxiv
Show abstract

The space of possible phenotype profiles over the Human Phenotype Ontology (HPO) is combinatorially vast, whereas the space of candidate disease genes is far smaller. Phenotype-driven diagnosis is therefore highly non-bijective: many distinct symptom profiles can correspond to the same gene, but only a small fraction of the theoretical phenotype space is biologically and clinically plausible. When a structured ontology exists, this constraint can be exploited to generate realistic synthetic cases. We introduce GraPhens, a simulation framework that uses gene-local HPO structure together with two empirically motivated soft priors, over the number of observed phenotypes per case and phenotype specificity, to generate synthetic phenotype-gene pairs that are novel yet clinically plausible. We use these synthetic cases to train GenPhenia, a graph neural network that reasons over patient-specific phenotype subgraphs rather than flat phenotype sets. Despite being trained entirely on synthetic data, GenPhenia generalizes to real, previously unseen clinical cases and outperforms existing phenotype-driven gene-prioritization methods on two real-world datasets. These results show that when patient-level data are scarce but a structured ontology is available, principled simulation can provide effective training data for end-to-end neural diagnosis models.

Matching journals

The top 7 journals account for 50% of the predicted probability mass.

1
Nature Machine Intelligence
61 papers in training set
Top 0.1%
14.0%
2
Bioinformatics
1061 papers in training set
Top 2%
12.1%
3
Bioinformatics Advances
184 papers in training set
Top 0.6%
6.2%
4
Scientific Reports
3102 papers in training set
Top 20%
6.2%
5
PLOS Computational Biology
1633 papers in training set
Top 7%
4.7%
6
Nature Communications
4913 papers in training set
Top 34%
4.7%
7
npj Digital Medicine
97 papers in training set
Top 1%
3.6%
50% of probability mass above
8
Cell Systems
167 papers in training set
Top 4%
3.5%
9
PLOS ONE
4510 papers in training set
Top 41%
3.5%
10
IEEE Transactions on Computational Biology and Bioinformatics
17 papers in training set
Top 0.1%
2.5%
11
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 28%
2.0%
12
Genome Research
409 papers in training set
Top 2%
1.8%
13
IEEE Journal of Biomedical and Health Informatics
34 papers in training set
Top 0.9%
1.8%
14
Nature Medicine
117 papers in training set
Top 2%
1.8%
15
Frontiers in Genetics
197 papers in training set
Top 4%
1.8%
16
BMC Bioinformatics
383 papers in training set
Top 5%
1.7%
17
Genome Medicine
154 papers in training set
Top 5%
1.5%
18
iScience
1063 papers in training set
Top 18%
1.5%
19
eLife
5422 papers in training set
Top 48%
1.3%
20
European Journal of Human Genetics
49 papers in training set
Top 0.9%
1.2%
21
Science Advances
1098 papers in training set
Top 27%
0.9%
22
Frontiers in Computational Neuroscience
53 papers in training set
Top 2%
0.8%
23
Communications Biology
886 papers in training set
Top 22%
0.8%
24
Genome Biology
555 papers in training set
Top 8%
0.7%
25
npj Systems Biology and Applications
99 papers in training set
Top 3%
0.7%
26
BioData Mining
15 papers in training set
Top 1%
0.7%
27
Nature
575 papers in training set
Top 16%
0.7%
28
JCO Clinical Cancer Informatics
18 papers in training set
Top 1%
0.6%
29
Patterns
70 papers in training set
Top 3%
0.6%
30
Computational and Structural Biotechnology Journal
216 papers in training set
Top 11%
0.6%