Extracting patient reported cannabis use and reasons for use from electronic health records: a benchmarking study of large language models
Wang, Y.; Bozkurt, S.; Le, N.; Alagappan, A.; Huang, C.; Rajwal, S.; Lewis, A.; Kim, J.; Falasinnu, T.
Show abstract
ObjectiveTo develop and evaluate a scalable and reproducible natural language processing (NLP) approach using large language models (LLM), to identify cannabis use status and reasons for cannabis use among patients with autoimmune rheumatic diseases (ARDs) from unstructured electronic health record (EHR) clinical notes. Methods and AnalysisWe conducted a retrospective study using EHR clinical notes from patients with ARDs (2015-2024). Notes were screened for cannabis-related mentions using fuzzy string matching against a curated keyword lexicon with a similarity threshold of 90, extracting 50-word context windows ({+/-}25 words). Two domain experts annotated 886 randomly sampled snippets across four classes: (1) not a true cannabis mention/uncertain, (2) denial of use, (3) positive past use, and (4) positive current use. Using these annotations, we compared multiple LLM prompting strategies (zero-shot to few-shot; temperature tuning) and a fine-tuned clinical model (GatorTron 345M). For "reason for use," 1,027 snippets were annotated into six categories: pain, nausea, sleep, anxiety/stress/mood, appetite, and not mentioned/unknown. Models were evaluated on a held-out validation set using accuracy, F1, recall, and precision. We then aggregated snippet-level predictions to patient level to describe temporal trends and subgroup differences. ResultsFor cannabis use status classification, the fine-tuned GatorTron model achieved the highest performance (accuracy 0.90; F1 0.91; recall 0.90; precision 0.90). For the reason of cannabis use classification, gpt-oss-20B achieved the highest performance (accuracy 0.77; F1 0.77; recall 0.77; precision 0.86). Patient-level analyses characterized trends in documented cannabis use from 2015-2024 and compared clinical characteristics between current users and patients denying use. ConclusionHigh-precision extraction of cannabis use status and reasons for use from EHR notes is feasible using a combination of fine-tuned clinical language models and LLM-based classifiers. This approach enables scalable measurement of patient-reported symptom self-management strategies in ARDs, supporting observational research and potential clinical decision support.
Matching journals
The top 7 journals account for 50% of the predicted probability mass.