Back

Biomedical Large Language Models and Prompt Engineering for Causality Assessment of Individual Case Safety Reports in Pharmacovigilance

Heckmann, N. S.; Papoutsi, D. G.; Barbieri, M. A.; Battini, V.; Molgaard, S. N.; Schmidt, S. O.; Melskens, L.; Sessa, M.

2026-02-24 pharmacology and therapeutics
10.64898/2026.02.19.26346467 medRxiv
Show abstract

BackgroundBiomedical Large Language Models (LLMs) combined with prompt engineering offer domain-specific reasoning, yet their application to individual-level causality assessment remains unexplored. This study evaluated five combinations of biomedical LLMs, prompting strategies, and causality algorithms by comparing their agreement with two human expert evaluators. Research design and methodsA total of 150 Individual Case Safety Reports (ICSRs) were analyzed: 140 reports from Food and Drug Administration Adverse Event Reporting System (FAERS), and 10 myocarditis/pericarditis ICSRs from Vaccine AERS (VAERS). Assessments were conducted using the Naranjo and WHO-UMC algorithms. Biomedical LLMs tested included TinyLlama 1.1B, Medicine LLaMA-3 8B, and MedLLaMA v20, combined with Chain-of-Thought (CoT) or Decomposition prompting. Agreement was measured using Gwets Agreement Coefficient 1 (AC1) and percentage agreement, alongside performance metrics and qualitative error analysis. ResultsThe Medicine LLaMA-3 8B-Naranjo-CoT combination achieved the highest agreement with human assessors for the final classification of causality (64%). Biomedical LLMs demonstrated low inter-rater agreement on critical items of causality assessment such as identification of listed AE, temporal plausibility, alternative causes, and objective evidence of AEs. Frequent model failures included irrelevant responses. ConclusionsBiomedical LLMs showed improved performance over general purpose models previously tested but remain suboptimal for reliable causality assessment of ICSRs.

Matching journals

The top 5 journals account for 50% of the predicted probability mass.

1
Clinical Pharmacology & Therapeutics
25 papers in training set
Top 0.1%
23.0%
2
Clinical and Translational Science
21 papers in training set
Top 0.1%
8.6%
3
Pharmacoepidemiology and Drug Safety
13 papers in training set
Top 0.1%
8.6%
4
Journal of Biomedical Informatics
45 papers in training set
Top 0.2%
8.6%
5
Frontiers in Pharmacology
100 papers in training set
Top 0.4%
6.5%
50% of probability mass above
6
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.5%
4.9%
7
BioData Mining
15 papers in training set
Top 0.1%
4.4%
8
npj Digital Medicine
97 papers in training set
Top 1%
4.0%
9
PLOS ONE
4510 papers in training set
Top 38%
3.7%
10
British Journal of Clinical Pharmacology
21 papers in training set
Top 0.2%
2.8%
11
Computational and Structural Biotechnology Journal
216 papers in training set
Top 4%
1.7%
12
Scientific Data
174 papers in training set
Top 1%
1.7%
13
Scientific Reports
3102 papers in training set
Top 61%
1.5%
14
JMIRx Med
31 papers in training set
Top 0.9%
1.4%
15
BMJ
49 papers in training set
Top 1.0%
0.9%
16
Epilepsy Research
12 papers in training set
Top 0.3%
0.9%
17
Frontiers in Medicine
113 papers in training set
Top 6%
0.8%
18
Research Synthesis Methods
20 papers in training set
Top 0.2%
0.8%
19
JAMA Network Open
127 papers in training set
Top 4%
0.8%
20
JCO Clinical Cancer Informatics
18 papers in training set
Top 0.8%
0.8%
21
The Lancet Infectious Diseases
71 papers in training set
Top 3%
0.7%
22
Heliyon
146 papers in training set
Top 7%
0.7%
23
Informatics in Medicine Unlocked
21 papers in training set
Top 1%
0.7%
24
Trials
25 papers in training set
Top 2%
0.7%