Back

From General-Purpose to Disease-Specific Features: Aligning LLM Embeddings on a Disease-Specific Biomedical Knowledge Graph for Drug Repurposing

Pandey, S.; Talo, M.; Siderovski, D. P.; Sumien, N.; Bozdag, S.

2026-03-10 bioinformatics
10.64898/2026.03.07.707871 bioRxiv
Show abstract

Identifying new therapeutic uses for existing drugs is a major challenge in biomedicine, especially for complex neurodegenerative conditions such as Alzheimer disease and related dementias (ADRD), where treatment options remain limited and relevant data are often sparse, heterogeneous, and difficult to integrate. Although general-purpose Large Language Model (LLM) embeddings encode rich semantic information, they often lack the task-specific biomedical context needed for inference tasks such as computational drug repurposing. We introduce Contextualizing LLM Embeddings via Attention-based gRaph learning (CLEAR), a multimodal representation-fusion framework that aligns LLM embeddings with the topological structure of a context-specific Knowledge Graph (KG). Across five benchmark datasets, CLEAR achieved state-of-the-art results, improving predictive performance (e.g., F1 score) by up to 30% over prior methods. We further applied CLEAR to identify FDA-approved drugs with potential for repurposing for ADRD, including Parkinson disease-related dementia and Lewy Body dementia. CLEAR learned a biologically coherent embedding space, prioritized leading ADRD drug candidates, and accurately summarized known therapeutic relationships for FDA-approved Alzheimer disease drugs. Overall, CLEAR shows that grounding biomedical LLM embeddings with context-specific KG signals can improve drug repurposing in data-sparse, real-world settings. GitHub: https://github.com/bozdaglab/CLEAR

Matching journals

The top 8 journals account for 50% of the predicted probability mass.

1
Nature Machine Intelligence
61 papers in training set
Top 0.1%
14.2%
2
Bioinformatics
1061 papers in training set
Top 3%
10.0%
3
Advanced Science
249 papers in training set
Top 2%
8.3%
4
Nature Communications
4913 papers in training set
Top 37%
3.9%
5
npj Digital Medicine
97 papers in training set
Top 1%
3.8%
6
Nature Methods
336 papers in training set
Top 3%
3.6%
7
Genome Medicine
154 papers in training set
Top 2%
3.5%
8
Cell Systems
167 papers in training set
Top 4%
3.5%
50% of probability mass above
9
Nucleic Acids Research
1128 papers in training set
Top 6%
3.5%
10
Bioinformatics Advances
184 papers in training set
Top 2%
3.2%
11
Briefings in Bioinformatics
326 papers in training set
Top 2%
3.0%
12
Scientific Reports
3102 papers in training set
Top 48%
2.3%
13
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 28%
2.1%
14
Nature Biomedical Engineering
42 papers in training set
Top 0.6%
2.1%
15
Patterns
70 papers in training set
Top 0.9%
1.7%
16
Nature Medicine
117 papers in training set
Top 3%
1.3%
17
Computational and Structural Biotechnology Journal
216 papers in training set
Top 6%
1.2%
18
Genome Biology
555 papers in training set
Top 6%
0.9%
19
Nature Biotechnology
147 papers in training set
Top 6%
0.9%
20
PLOS Computational Biology
1633 papers in training set
Top 21%
0.9%
21
BMC Bioinformatics
383 papers in training set
Top 6%
0.9%
22
Nature
575 papers in training set
Top 15%
0.8%
23
Communications Biology
886 papers in training set
Top 22%
0.8%
24
PLOS ONE
4510 papers in training set
Top 66%
0.8%
25
IEEE Journal of Biomedical and Health Informatics
34 papers in training set
Top 2%
0.8%
26
eLife
5422 papers in training set
Top 56%
0.8%
27
Journal of Chemical Information and Modeling
207 papers in training set
Top 3%
0.7%
28
npj Systems Biology and Applications
99 papers in training set
Top 3%
0.7%
29
GigaScience
172 papers in training set
Top 3%
0.7%
30
NAR Genomics and Bioinformatics
214 papers in training set
Top 4%
0.7%