Back

Unified Clinical Vocabulary Embeddings for Advancing Precision Medicine

Johnson, R.; Gottlieb, U.; Shaham, G.; Eisen, L.; Waxman, J.; Devons-Sberro, S.; Ginder, C. R.; Hong, P.; Sayeed, R.; Reis, B. Y.; Balicer, R. D.; Dagan, N.; Zitnik, M.

2024-12-05 health informatics
10.1101/2024.12.03.24318322 medRxiv
Show abstract

Integrating structured clinical knowledge into artificial intelligence (AI) models remains a major challenge. Medical codes primarily reflect administrative workflows rather than clinical reason ing, limiting AI models ability to capture true clinical relationships and undermining their gen eralizability. To address this, we introduce ClinGraph, a clinical knowledge graph that integrates eight EHR-based vocabularies, and ClinVec, a set of 153,166 clinical code embeddings derived from ClinGraph using a graph transformer neural network. ClinVec provides a machine-readable representation of clinical knowledge that captures semantic relationships among diagnoses, med ications, laboratory tests, and procedures. Panels of clinicians from multiple institutions evalu ated the embeddings across 96 diseases and more than 3,000 clinical codes, confirming their alignment with expert knowledge. In a retrospective analysis of 4.57 million patients from Clalit Health Services, we show that ClinVec supports phenotype risk scoring and stratifies individuals by survival outcomes. We further demonstrate that injecting ClinVec into large language models improves performance on medical question answering, including for region-specific clinical sce narios. ClinVec enables structured clinical knowledge to be injected into predictive and genera tive AI models, bridging the gap between EHR codes and clinical reasoning.

Matching journals

The top 5 journals account for 50% of the predicted probability mass.

1
npj Digital Medicine
97 papers in training set
Top 0.2%
22.9%
2
Science Translational Medicine
111 papers in training set
Top 0.1%
8.6%
3
Nature Medicine
117 papers in training set
Top 0.2%
8.6%
4
Med
38 papers in training set
Top 0.1%
7.3%
5
Nature Biomedical Engineering
42 papers in training set
Top 0.1%
6.4%
50% of probability mass above
6
Nature Communications
4913 papers in training set
Top 32%
4.9%
7
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.7%
3.7%
8
Nature Machine Intelligence
61 papers in training set
Top 0.8%
3.7%
9
Scientific Reports
3102 papers in training set
Top 58%
1.7%
10
Journal of Biomedical Informatics
45 papers in training set
Top 0.9%
1.5%
11
Science Advances
1098 papers in training set
Top 20%
1.5%
12
Science
429 papers in training set
Top 16%
1.4%
13
The Lancet Digital Health
25 papers in training set
Top 0.5%
1.4%
14
Communications Medicine
85 papers in training set
Top 0.4%
1.4%
15
eLife
5422 papers in training set
Top 48%
1.2%
16
JCO Clinical Cancer Informatics
18 papers in training set
Top 0.6%
1.2%
17
Bioinformatics
1061 papers in training set
Top 8%
1.1%
18
PLOS ONE
4510 papers in training set
Top 62%
1.0%
19
Patterns
70 papers in training set
Top 2%
1.0%
20
Cell Reports Medicine
140 papers in training set
Top 6%
1.0%
21
Nature
575 papers in training set
Top 13%
1.0%
22
Nature Methods
336 papers in training set
Top 5%
0.9%
23
Advanced Science
249 papers in training set
Top 16%
0.9%
24
Nature Computational Science
50 papers in training set
Top 1%
0.9%
25
Cell Systems
167 papers in training set
Top 11%
0.8%
26
JMIR Medical Informatics
17 papers in training set
Top 1%
0.8%
27
Cell
370 papers in training set
Top 17%
0.8%
28
Nature Biotechnology
147 papers in training set
Top 8%
0.7%
29
Genome Medicine
154 papers in training set
Top 10%
0.5%
30
iScience
1063 papers in training set
Top 40%
0.5%