Biomedical Large Language Models and Prompt Engineering for Causality Assessment of Individual Case Safety Reports in Pharmacovigilance
Heckmann, N. S.; Papoutsi, D. G.; Barbieri, M. A.; Battini, V.; Molgaard, S. N.; Schmidt, S. O.; Melskens, L.; Sessa, M.
Show abstract
BackgroundBiomedical Large Language Models (LLMs) combined with prompt engineering offer domain-specific reasoning, yet their application to individual-level causality assessment remains unexplored. This study evaluated five combinations of biomedical LLMs, prompting strategies, and causality algorithms by comparing their agreement with two human expert evaluators. Research design and methodsA total of 150 Individual Case Safety Reports (ICSRs) were analyzed: 140 reports from Food and Drug Administration Adverse Event Reporting System (FAERS), and 10 myocarditis/pericarditis ICSRs from Vaccine AERS (VAERS). Assessments were conducted using the Naranjo and WHO-UMC algorithms. Biomedical LLMs tested included TinyLlama 1.1B, Medicine LLaMA-3 8B, and MedLLaMA v20, combined with Chain-of-Thought (CoT) or Decomposition prompting. Agreement was measured using Gwets Agreement Coefficient 1 (AC1) and percentage agreement, alongside performance metrics and qualitative error analysis. ResultsThe Medicine LLaMA-3 8B-Naranjo-CoT combination achieved the highest agreement with human assessors for the final classification of causality (64%). Biomedical LLMs demonstrated low inter-rater agreement on critical items of causality assessment such as identification of listed AE, temporal plausibility, alternative causes, and objective evidence of AEs. Frequent model failures included irrelevant responses. ConclusionsBiomedical LLMs showed improved performance over general purpose models previously tested but remain suboptimal for reliable causality assessment of ICSRs.
Matching journals
The top 5 journals account for 50% of the predicted probability mass.