Back

Cannabis Use Documentation within the Electronic Health Record: A Use Case for Natural Language Processing Methods

Pradhan, A. M.; Shetty, V. A.; Gregor, C.; Graham, J. H.; Tusing, L.; Hirsch, A. G.; Hall, E.; Troiani, V.; Davis, M. P.; Bieler, D. L.; Romagnoli, K. M.; Kraus, C. K.; Piper, B. J.; Wright, E. A.

2026-03-02 addiction medicine
10.64898/2026.02.27.26347207 medRxiv
Show abstract

IntroductionRecreational and medical cannabis use (CU) information is often available within the electronic health record (EHR) in a format that is impractical for health care provider use. Transformation of free-text EHR documentation in notes to discrete elements is possible using natural language processing (NLP) and has the potential to characterize CU efficiently. The objective of this study was to develop an NLP algorithm to identify documentation of CU within EHR unstructured clinical notes. MethodsWe identified EHR notes with cannabis-related terminologies through a keyword search among all Geisinger patients with at least one encounter between 1/1/2013 and 6/30/2022. We trained four NLP models to classify notes into six categories based on time, context, and reliability of CU documentation identified through manual annotation. We compared the demographic characteristics of patients with positive classification for CU using the best-performing model to those of the overall population. ResultsOf the over 1.7 million eligible patients, 150,726 (8.6%) were flagged as cannabis users. The Bio-ClinicalBERT, a transformer-based NLP model, achieved close to human performance in classifying CU (weighted Precision=91.4, Recall=93.3, F-score=92.4). Cannabis users had higher BMI and were at least nine-fold more likely to use tobacco, alcohol, and illicit substances. ConclusionOur study evaluated the prevalence of CU documentation across the entire corpus of EHR notes data without population segmentation. The NLP methodologies used achieved performance close to that of human annotation and laid the foundation for identifying and classifying CU within unstructured data sources, with future applications in research and patient care. Plain Language SummaryMarijuana, also known as cannabis, may impact the health of patients, yet it is not routinely captured in medical records, and when documented, it is often found in unstructured formats (e.g., progress notes) rather than in discrete fields. Incomplete and unstructured capture limits many functional capabilities within the EHR that enhance patient care (e.g., drug interactions, notifications) and limit researchers from identifying patients routinely exposed to marijuana use. The transformation of free-text documentation of cannabis use (CU) into discrete elements can be performed using natural language processing (NLP). The objective of this study was to develop an NLP model to identify CU in unstructured clinical notes in the EHR. We examined the EHRs of Geisinger patients in Pennsylvania over a 10-year period. Among 1.7 million patients, 9% were identified as CU. One of the NLP models tested, Bio-ClinicalBERT, achieved the highest performance. Cannabis users had a higher BMI and were ten-fold more likely to be tobacco users, ten-fold more likely to use alcohol, and nine-fold more likely to use illicit substances. NLP can be used to better understand the risks and benefits of CU at a population level and may improve patient identification to assist clinical decision-making. Future CU epidemiological research should continue to explore other avenues to automate and improve CU documentation by leveraging rapidly evolving technologies, such as artificial intelligence-driven tools.

Matching journals

The top 7 journals account for 50% of the predicted probability mass.

1
Frontiers in Psychiatry
83 papers in training set
Top 0.1%
14.2%
2
Journal of Biomedical Informatics
45 papers in training set
Top 0.1%
10.0%
3
PLOS Digital Health
91 papers in training set
Top 0.2%
10.0%
4
Frontiers in Artificial Intelligence
18 papers in training set
Top 0.1%
6.2%
5
PLOS ONE
4510 papers in training set
Top 32%
4.8%
6
JAMA Network Open
127 papers in training set
Top 0.6%
4.8%
7
Journal of General Internal Medicine
20 papers in training set
Top 0.1%
4.8%
50% of probability mass above
8
JAMIA Open
37 papers in training set
Top 0.4%
3.6%
9
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.8%
3.5%
10
International Journal of Medical Informatics
25 papers in training set
Top 0.5%
2.8%
11
Pharmacoepidemiology and Drug Safety
13 papers in training set
Top 0.1%
2.7%
12
npj Digital Medicine
97 papers in training set
Top 1%
2.7%
13
BMC Medical Informatics and Decision Making
39 papers in training set
Top 1%
2.6%
14
Drug and Alcohol Dependence
37 papers in training set
Top 0.4%
1.8%
15
BioData Mining
15 papers in training set
Top 0.3%
1.7%
16
Statistics in Medicine
34 papers in training set
Top 0.2%
1.6%
17
Scientific Reports
3102 papers in training set
Top 62%
1.5%
18
BMC Health Services Research
42 papers in training set
Top 1%
1.5%
19
International Journal of Drug Policy
11 papers in training set
Top 0.2%
1.3%
20
The Lancet Public Health
20 papers in training set
Top 0.4%
1.3%
21
European Journal of Epidemiology
40 papers in training set
Top 0.6%
0.9%
22
JMIRx Med
31 papers in training set
Top 1%
0.9%
23
BMC Medical Research Methodology
43 papers in training set
Top 1%
0.8%
24
JMIR Formative Research
32 papers in training set
Top 2%
0.7%
25
Journal of Medical Internet Research
85 papers in training set
Top 5%
0.7%
26
Genes
126 papers in training set
Top 4%
0.6%
27
Psychiatry Research
35 papers in training set
Top 2%
0.6%