Back

Using Natural Language Processing of Clinical Notes to Supplement Structured Electronic Health Record Data for Phenotyping Smoking and Obesity in a Healthcare System

Yang, J.; Gu, B.; Pillai, H.; Lii, J.; Cronkite, D.; Marsolo, K. A.; Desai, R. J.

2026-01-21 health informatics
10.64898/2026.01.18.26344356 medRxiv
Show abstract

PurposeStudies based on electronic health records (EHR) often rely on structured data, which may incompletely capture important clinical phenotypes in EHR notes. The purpose of this study was to assess two natural language processing (NLP) tools to extract phenotypes from unstructured EHR notes, and to evaluate the added value of integrating NLP-derived phenotypes with structured EHR data at a health system scale. MethodsThis retrospective study is based on inpatient and outpatient EHR data from the Mass General Brigham healthcare system between January 1, 2019 and December 31, 2020. Two established rule-based NLP tools were applied to extract smoking and obesity information from 19,215,303 clinical notes of 503,025 patients. NLP performance was evaluated through manual review of stratified samples. Phenotype prevalence was estimated using structured EHR data alone and compared with prevalence estimates obtained by supplementing structured data with NLP-derived features. ResultsBoth NLP tools exhibited high performance, with both accuracy and F1 score of 0.99 for smoking, and 0.92 and 0.91 for obesity, respectively. The combination of NLP and structured data identified 220,714 patients (43.88%) with smoking, compared with 170,396 patients (33.87%) identified using structured data alone, representing a 29.5% relative increase. For obesity, NLP identified 121,360 patients (24.12%) from EHR notes, and 169,905 patients (33.78%) were documented in structured data; inclusion of NLP-derived features contributed additional 32,823 patients, corresponding to a 19.3% relative increase. ConclusionNLP-derived phenotypes from unstructured EHR notes substantially improve patient identification for both smoking and obesity compared with structured EHR data alone at scale.

Matching journals

The top 3 journals account for 50% of the predicted probability mass.