Solving the Diagnostic Odyssey with Synthetic Phenotype Data

Colangelo, G.; Marti, M.

2026-03-23 bioinformatics

10.64898/2026.03.19.712946 bioRxiv

Show abstract

The space of possible phenotype profiles over the Human Phenotype Ontology (HPO) is combinatorially vast, whereas the space of candidate disease genes is far smaller. Phenotype-driven diagnosis is therefore highly non-bijective: many distinct symptom profiles can correspond to the same gene, but only a small fraction of the theoretical phenotype space is biologically and clinically plausible. When a structured ontology exists, this constraint can be exploited to generate realistic synthetic cases. We introduce GraPhens, a simulation framework that uses gene-local HPO structure together with two empirically motivated soft priors, over the number of observed phenotypes per case and phenotype specificity, to generate synthetic phenotype-gene pairs that are novel yet clinically plausible. We use these synthetic cases to train GenPhenia, a graph neural network that reasons over patient-specific phenotype subgraphs rather than flat phenotype sets. Despite being trained entirely on synthetic data, GenPhenia generalizes to real, previously unseen clinical cases and outperforms existing phenotype-driven gene-prioritization methods on two real-world datasets. These results show that when patient-level data are scarce but a structured ontology is available, principled simulation can provide effective training data for end-to-end neural diagnosis models.

Matching journals

●Non-profit ◐University press ○Commercial

The top 7 journals account for 50% of the predicted probability mass.

Only show non-profit

Nature Machine Intelligence

○ 61 papers in training set

◐ 1061 papers in training set

Bioinformatics Advances

◐ 184 papers in training set

Scientific Reports

○ 3102 papers in training set

PLOS Computational Biology

● 1633 papers in training set

Nature Communications

○ 4913 papers in training set

npj Digital Medicine

○ 97 papers in training set

50% of probability mass above

○ 167 papers in training set

● 4510 papers in training set

IEEE Transactions on Computational Biology and Bioinformatics

● 17 papers in training set

Proceedings of the National Academy of Sciences

● 2130 papers in training set

Genome Research

● 409 papers in training set

IEEE Journal of Biomedical and Health Informatics

● 34 papers in training set

Nature Medicine

○ 117 papers in training set

Frontiers in Genetics

○ 197 papers in training set

BMC Bioinformatics

○ 383 papers in training set

Genome Medicine

○ 154 papers in training set

○ 1063 papers in training set

● 5422 papers in training set

European Journal of Human Genetics

○ 49 papers in training set

Science Advances

● 1098 papers in training set

Frontiers in Computational Neuroscience

○ 53 papers in training set

Communications Biology

○ 886 papers in training set

○ 555 papers in training set

npj Systems Biology and Applications

○ 99 papers in training set

○ 15 papers in training set

○ 575 papers in training set

JCO Clinical Cancer Informatics

● 18 papers in training set

○ 70 papers in training set

Computational and Structural Biotechnology Journal

● 216 papers in training set