Back

Drug-drug interaction identification using large language models

Blotske, K.; Zhao, X.; Henry, K.; Gao, Y.; Tilley, A.; Cargile, M.; Murray, B.; Smith, S. E.; Barreto, E.; Bauer, S.; Sohn, S.; Liu, T.; Sikora, A.

2025-12-04 pharmacology and therapeutics
10.64898/2025.12.03.25341549 medRxiv
Show abstract

BackgroundDrug-drug interactions (DDIs) are a significant source of morbidity and adverse drug events (ADEs), particularly in situations of polypharmacy and complex medication regimens. While rules-based software integrated in electronic health records (EHRs) has demonstrated proficiency in identifying DDIs present in medication regimens, large language model (LLM) based identification requires thorough benchmarking and performance evaluation using high-quality datasets for safe use. The purpose of this study was to develop a series of performance benchmarking experiments specifically for LLM performance in identification and management of DDIs using a specifically curated clinician-annotated dataset of clinically-relevant DDIs. MethodsWe evaluated three LLMs (GPT-4o-mini, MedGemma-27B, LLaMA3-70B) using a clinician-annotated benchmark dataset of 750 DDI scenarios spanning three levels of diagnostic complexity. Tasks were aligned with flexible judgment formats: (1) a pointwise two-drug classification task, (2) a pairwise three-drug discrimination task, and (3) a listwise 4-6 drug selection task. Standardized zero-shot prompting with task-specific instructions was applied for all models. Performance was assessed using precision, recall, F1 score, and accuracy. Reliability was quantified using self-consistency across repeated runs and confidence-aligned metrics to capture stability in model reasoning. ResultsAcross the three experiments, model performance varied by task structure and interaction severity. LLaMA3-70B demonstrated the highest recall and F1 score in the pointwise task, whereas GPT-4o-mini achieved superior accuracy and consistency in the pairwise and listwise tasks. MedGemma-27B showed competitive performance in identifying Category D interactions. Self-consistency decreased as task complexity increased, highlighting reduced stability in multi-drug reasoning. No model exhibited uniformly high reliability across all judgment formats. ConclusionsCurrent LLMs show promising but uneven capabilities in identifying DDIs across clinically relevant task structures. Performance degrades as the reasoning space expands, and stability across repeated queries remains limited. These findings emphasize the need for multi-format evaluation frameworks and reliability-aware assessment when considering LLMs for medication-safety applications.

Matching journals

The top 5 journals account for 50% of the predicted probability mass.

1
BioData Mining
15 papers in training set
Top 0.1%
14.6%
2
Clinical Pharmacology & Therapeutics
25 papers in training set
Top 0.1%
14.6%
3
Journal of Biomedical Informatics
45 papers in training set
Top 0.1%
14.2%
4
Pharmacoepidemiology and Drug Safety
13 papers in training set
Top 0.1%
6.3%
5
npj Digital Medicine
97 papers in training set
Top 0.8%
6.3%
50% of probability mass above
6
JMIRx Med
31 papers in training set
Top 0.1%
4.5%
7
PLOS ONE
4510 papers in training set
Top 35%
4.1%
8
Frontiers in Pharmacology
100 papers in training set
Top 0.7%
3.9%
9
Clinical and Translational Science
21 papers in training set
Top 0.2%
3.6%
10
Computational and Structural Biotechnology Journal
216 papers in training set
Top 2%
3.6%
11
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.8%
3.2%
12
BMC Medical Informatics and Decision Making
39 papers in training set
Top 1%
2.1%
13
British Journal of Clinical Pharmacology
21 papers in training set
Top 0.3%
1.9%
14
Scientific Reports
3102 papers in training set
Top 59%
1.7%
15
International Journal of Medical Informatics
25 papers in training set
Top 1%
1.3%
16
JAMIA Open
37 papers in training set
Top 1%
0.9%
17
Epilepsy Research
12 papers in training set
Top 0.3%
0.7%
18
Journal of Medical Internet Research
85 papers in training set
Top 5%
0.7%
19
Heliyon
146 papers in training set
Top 8%
0.6%
20
JCO Clinical Cancer Informatics
18 papers in training set
Top 1.0%
0.6%