Back

RD-Embed: Unified representations of rare-disease knowledge from clinical records

Groza, T.; Tan, F.; Lim, N. T. R.; Shanmugasundar, M. W.; Kappaganthu, J.; Lieviant, J. A.; Karnani, N.; Chen, H.; Wong, T. Y.; Jamuar, S. S.

2026-04-04 genetic and genomic medicine
10.64898/2026.04.02.26350083 medRxiv
Show abstract

Rare diseases often present with incomplete, evolving symptoms and signs scattered across clinical notes and coded records, making diagnosis and gene discovery difficult even when genomic data are available. Existing approaches either depend on curated phenotype profiles or use general biomedical language models that are not aligned to rare-disease knowledge, limiting performance in early or ambiguous clinical presentations. Here, we show that RD-Embed - a three-stage representation framework that builds a base space that preserves domain knowledge, aligns clinical text and SNOMED-derived signals, and refines relationships with graph-based learning - enables robust rare-disease retrieval from heterogeneous clinical records. Across ten rare-disease datasets, RD-Embed attains up to >50% top-ten diagnostic retrieval using combined text and phenotype features, compared with ~30% on average for other embedding models and similarly sized large language models. On an EHR stress test, clinical alignment substantially improves text-based retrieval compared with ontology-only representations, supporting use in routine EHR data. We suggest RD-Embed is lightweight model that can be incorporated into existing hospital systems that supports rare disease identification and diagnosis, and gene prioritization.

Matching journals

The top 3 journals account for 50% of the predicted probability mass.

1
npj Digital Medicine
97 papers in training set
Top 0.2%
23.3%
2
Genome Medicine
154 papers in training set
Top 0.1%
19.3%
3
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.3%
9.5%
50% of probability mass above
4
Journal of Biomedical Informatics
45 papers in training set
Top 0.3%
5.0%
5
Nature Medicine
117 papers in training set
Top 0.4%
5.0%
6
Nucleic Acids Research
1128 papers in training set
Top 5%
3.7%
7
Nature Communications
4913 papers in training set
Top 42%
3.2%
8
Med
38 papers in training set
Top 0.1%
3.2%
9
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 25%
2.5%
10
Genetics in Medicine
69 papers in training set
Top 0.5%
2.1%
11
Bioinformatics
1061 papers in training set
Top 7%
1.8%
12
Scientific Reports
3102 papers in training set
Top 56%
1.8%
13
iScience
1063 papers in training set
Top 17%
1.5%
14
Nature Human Behaviour
85 papers in training set
Top 3%
1.0%
15
Nature Genetics
240 papers in training set
Top 7%
0.8%
16
PLOS ONE
4510 papers in training set
Top 65%
0.8%
17
Cell Systems
167 papers in training set
Top 11%
0.8%
18
Nature Biomedical Engineering
42 papers in training set
Top 2%
0.8%
19
European Journal of Human Genetics
49 papers in training set
Top 1%
0.7%
20
Genome Biology
555 papers in training set
Top 8%
0.7%
21
Bioinformatics Advances
184 papers in training set
Top 5%
0.7%
22
Cell Genomics
162 papers in training set
Top 7%
0.7%
23
eLife
5422 papers in training set
Top 63%
0.5%