Back

SmokeBERT: A BERT-based Model for Quantitative Smoking History Extraction from Clinical Narratives to Improve Lung Cancer Screening

Xue, Y.; Zhu, Y.; Zhuang, L.; Oh, Y.; Taira, R.; Aberle, D. R.; Prosper, A. E.; Hsu, W.; Lin, Y.

2025-06-20 health informatics
10.1101/2025.06.18.25329870 medRxiv
Show abstract

Tobacco use is a critical risk factor for diseases such as cancer and cardiovascular disorders. While electronic health records can capture categorical smoking statuses accurately, granular quantitative details, such as pack years and years since quitting, are often embedded in clinical narratives. This information is crucial for assessing disease risk and determining eligibility for lung cancer screening (LCS). Existing natural language processing (NLP) tools excelled at identifying smoking statuses but struggled with extracting detailed quantitative data. To address this, we developed SmokeBERT, a fine-tuned BERT-based model optimized for extracting detailed smoking histories. Evaluations against a state-of-the-art rule-based NLP model demonstrated its superior performance on F1 scores (0.97 vs. 0.88 on the hold-out test set) and identification of LCS-eligible patients (e.g., 98% vs. 60% for [≥]20 pack years). Future work includes creating a multilingual, language-agnostic version of SmokeBERT by incorporating datasets in multiple languages, exploring ensemble methods, and testing on larger datasets.

Matching journals

The top 5 journals account for 50% of the predicted probability mass.

1
Journal of Biomedical Informatics
45 papers in training set
Top 0.1%
33.9%
2
International Journal of Medical Informatics
25 papers in training set
Top 0.1%
6.6%
3
Scientific Reports
3102 papers in training set
Top 26%
4.4%
4
npj Digital Medicine
97 papers in training set
Top 1%
4.1%
5
BMC Medical Informatics and Decision Making
39 papers in training set
Top 0.7%
3.7%
50% of probability mass above
6
Bioinformatics
1061 papers in training set
Top 6%
3.2%
7
JAMIA Open
37 papers in training set
Top 0.5%
3.0%
8
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.9%
2.7%
9
Frontiers in Digital Health
20 papers in training set
Top 0.4%
2.4%
10
Artificial Intelligence in Medicine
15 papers in training set
Top 0.2%
2.1%
11
PLOS ONE
4510 papers in training set
Top 47%
2.1%
12
BMC Bioinformatics
383 papers in training set
Top 4%
1.7%
13
Computers in Biology and Medicine
120 papers in training set
Top 2%
1.7%
14
IEEE Journal of Biomedical and Health Informatics
34 papers in training set
Top 1.0%
1.7%
15
Journal of Medical Internet Research
85 papers in training set
Top 3%
1.5%
16
JMIR Medical Informatics
17 papers in training set
Top 1.0%
1.3%
17
Nature Communications
4913 papers in training set
Top 56%
1.3%
18
Patterns
70 papers in training set
Top 1%
1.3%
19
Journal of Personalized Medicine
28 papers in training set
Top 0.6%
1.3%
20
JCO Clinical Cancer Informatics
18 papers in training set
Top 0.6%
1.1%
21
NAR Genomics and Bioinformatics
214 papers in training set
Top 3%
1.0%
22
Med
38 papers in training set
Top 0.6%
0.9%
23
PLOS Digital Health
91 papers in training set
Top 2%
0.8%
24
iScience
1063 papers in training set
Top 30%
0.8%
25
GigaScience
172 papers in training set
Top 3%
0.8%
26
Healthcare
16 papers in training set
Top 2%
0.8%
27
BioData Mining
15 papers in training set
Top 0.8%
0.8%
28
Research Synthesis Methods
20 papers in training set
Top 0.2%
0.7%
29
Informatics in Medicine Unlocked
21 papers in training set
Top 2%
0.5%