SmokeBERT: A BERT-based Model for Quantitative Smoking History Extraction from Clinical Narratives to Improve Lung Cancer Screening

Xue, Y.; Zhu, Y.; Zhuang, L.; Oh, Y.; Taira, R.; Aberle, D. R.; Prosper, A. E.; Hsu, W.; Lin, Y.

2025-06-20 health informatics

10.1101/2025.06.18.25329870 medRxiv

Show abstract

Tobacco use is a critical risk factor for diseases such as cancer and cardiovascular disorders. While electronic health records can capture categorical smoking statuses accurately, granular quantitative details, such as pack years and years since quitting, are often embedded in clinical narratives. This information is crucial for assessing disease risk and determining eligibility for lung cancer screening (LCS). Existing natural language processing (NLP) tools excelled at identifying smoking statuses but struggled with extracting detailed quantitative data. To address this, we developed SmokeBERT, a fine-tuned BERT-based model optimized for extracting detailed smoking histories. Evaluations against a state-of-the-art rule-based NLP model demonstrated its superior performance on F1 scores (0.97 vs. 0.88 on the hold-out test set) and identification of LCS-eligible patients (e.g., 98% vs. 60% for [≥]20 pack years). Future work includes creating a multilingual, language-agnostic version of SmokeBERT by incorporating datasets in multiple languages, exploring ensemble methods, and testing on larger datasets.

SmokeBERT: A BERT-based Model for Quantitative Smoking History Extraction from Clinical Narratives to Improve Lung Cancer Screening

Matching journals