Unified Clinical Vocabulary Embeddings for Advancing Precision Medicine
Johnson, R.; Gottlieb, U.; Shaham, G.; Eisen, L.; Waxman, J.; Devons-Sberro, S.; Ginder, C. R.; Hong, P.; Sayeed, R.; Reis, B. Y.; Balicer, R. D.; Dagan, N.; Zitnik, M.
Show abstract
Integrating structured clinical knowledge into artificial intelligence (AI) models remains a major challenge. Medical codes primarily reflect administrative workflows rather than clinical reason ing, limiting AI models ability to capture true clinical relationships and undermining their gen eralizability. To address this, we introduce ClinGraph, a clinical knowledge graph that integrates eight EHR-based vocabularies, and ClinVec, a set of 153,166 clinical code embeddings derived from ClinGraph using a graph transformer neural network. ClinVec provides a machine-readable representation of clinical knowledge that captures semantic relationships among diagnoses, med ications, laboratory tests, and procedures. Panels of clinicians from multiple institutions evalu ated the embeddings across 96 diseases and more than 3,000 clinical codes, confirming their alignment with expert knowledge. In a retrospective analysis of 4.57 million patients from Clalit Health Services, we show that ClinVec supports phenotype risk scoring and stratifies individuals by survival outcomes. We further demonstrate that injecting ClinVec into large language models improves performance on medical question answering, including for region-specific clinical sce narios. ClinVec enables structured clinical knowledge to be injected into predictive and genera tive AI models, bridging the gap between EHR codes and clinical reasoning.
Matching journals
The top 5 journals account for 50% of the predicted probability mass.