RD-Embed: Unified representations of rare-disease knowledge from clinical records
Groza, T.; Tan, F.; Lim, N. T. R.; Shanmugasundar, M. W.; Kappaganthu, J.; Lieviant, J. A.; Karnani, N.; Chen, H.; Wong, T. Y.; Jamuar, S. S.
Show abstract
Rare diseases often present with incomplete, evolving symptoms and signs scattered across clinical notes and coded records, making diagnosis and gene discovery difficult even when genomic data are available. Existing approaches either depend on curated phenotype profiles or use general biomedical language models that are not aligned to rare-disease knowledge, limiting performance in early or ambiguous clinical presentations. Here, we show that RD-Embed - a three-stage representation framework that builds a base space that preserves domain knowledge, aligns clinical text and SNOMED-derived signals, and refines relationships with graph-based learning - enables robust rare-disease retrieval from heterogeneous clinical records. Across ten rare-disease datasets, RD-Embed attains up to >50% top-ten diagnostic retrieval using combined text and phenotype features, compared with ~30% on average for other embedding models and similarly sized large language models. On an EHR stress test, clinical alignment substantially improves text-based retrieval compared with ontology-only representations, supporting use in routine EHR data. We suggest RD-Embed is lightweight model that can be incorporated into existing hospital systems that supports rare disease identification and diagnosis, and gene prioritization.
Matching journals
The top 3 journals account for 50% of the predicted probability mass.