Back

Relationship Extraction for Adverse Drug Events in Clinical Notes Using Large Language Models

Plasek, J. M.; Li, Y.; Amato, M. G.; Foer, D.; Seger, D. L.; Alzaidi, S.; Zhou, H.; Jackson, G. P.; Bates, D. W.; Zhou, L.

2026-06-01 health informatics
10.64898/2026.05.28.26354362 medRxiv
Show abstract

Background: Adverse drug events (ADEs) are a critical indicator of patient safety but are often documented only in free-text clinical notes. The potential of recent advances in natural language processing (NLP), particularly generative large language models (LLMs), to identify ADEs remains understudied. This study aimed to compare the performance of multiple LLMs in identifying ADE-Drug relationships in inpatient and ambulatory clinical notes. Methods: We used clinical notes from the 2018 National NLP Clinical Challenge (n2c2) ADE dataset (inpatient; n=505) and from outpatient encounters (n=2,555) between October 1, 2018, and December 31, 2019, at a large academic medical center based in New England. Notes were pre-processed into snippets for model input. Evaluated Models included: GPT-4o, GPT-4o-mini, LLAMA 3.3-70B and their instruction fine-tuned variants (including low-rank adapters for LLAMA). Performance was assessed using both strict and relaxed evaluations (precision, recall, and F1) for all models, followed by manual evaluation (exact semantic match, partial match, missing ADE, drug mention only, not a drug, or wrong) of the two best-performing models. Results: GPT-4o and GPT-4o-mini were the top-performing models among those evaluated. GPT-4o consistently outperformed GPT-4o-mini in ADE extraction across both datasets, with higher F1-scores (0.524 vs. 0.381) and a more balanced precision-recall profile. Both models captured ADEs effectively in explicit and complex clinical contexts, although limitations included misclassification of pre-existing allergies and occasional conflation of therapeutic indications with adverse effects. GPT-4o achieved higher exact match coverage and fewer errors across clinical notes, indicating more reliable performance in both inpatient and ambulatory settings. Conclusion: This work establishes a foundation for integrating LLM methods into real-world drug safety surveillance, with direct implications for improving patient safety.

Matching journals

The top 4 journals account for 50% of the predicted probability mass.

1
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.1%
23.2%
2
Journal of Biomedical Informatics
45 papers in training set
Top 0.1%
14.8%
3
npj Digital Medicine
97 papers in training set
Top 0.5%
10.8%
4
International Journal of Medical Informatics
25 papers in training set
Top 0.1%
7.0%
50% of probability mass above
5
BMC Medical Informatics and Decision Making
39 papers in training set
Top 0.4%
6.6%
6
JAMIA Open
37 papers in training set
Top 0.2%
6.5%
7
JCO Clinical Cancer Informatics
18 papers in training set
Top 0.2%
3.7%
8
Frontiers in Digital Health
20 papers in training set
Top 0.5%
1.9%
9
Journal of Medical Internet Research
85 papers in training set
Top 2%
1.8%
10
JMIR Medical Informatics
17 papers in training set
Top 0.7%
1.7%
11
Scientific Reports
3102 papers in training set
Top 56%
1.7%
12
Bioinformatics
1061 papers in training set
Top 7%
1.7%
13
PLOS ONE
4510 papers in training set
Top 57%
1.4%
14
The Lancet Digital Health
25 papers in training set
Top 0.6%
1.3%
15
Med
38 papers in training set
Top 0.6%
0.9%
16
Artificial Intelligence in Medicine
15 papers in training set
Top 0.5%
0.9%
17
Computers in Biology and Medicine
120 papers in training set
Top 4%
0.8%
18
BMC Bioinformatics
383 papers in training set
Top 6%
0.8%
19
Clinical Pharmacology & Therapeutics
25 papers in training set
Top 0.8%
0.7%
20
iScience
1063 papers in training set
Top 32%
0.7%
21
BMC Medical Research Methodology
43 papers in training set
Top 2%
0.7%
22
IEEE Journal of Biomedical and Health Informatics
34 papers in training set
Top 3%
0.5%
23
Biology Methods and Protocols
53 papers in training set
Top 3%
0.5%
24
JAMA Pediatrics
10 papers in training set
Top 0.2%
0.5%
25
Cureus
67 papers in training set
Top 6%
0.5%
26
Clinical and Translational Science
21 papers in training set
Top 1%
0.5%