Back

Automated Derivation of Diagnostic Criteria for Lung Cancer using Natural Language Processing on Electronic Health Records: A pilot study.

Houston, A.; Williams, S.; Ricketts, W.; Gutteridge, C.; Tackaberry, C.; Conibear, J.

2024-02-21 health informatics
10.1101/2024.02.20.24303084 medRxiv
Show abstract

BackgroundThe digitisation of healthcare records has generated vast amounts of unstructured data, presenting opportunities for improvements in disease diagnosis when clinical coding falls short, such as in the recording of patient symptoms. This study presents an approach using natural language processing to extract clinical concepts from free-text which are used to automatically form diagnostic criteria for lung cancer from unstructured secondary-care data. MethodsPatients aged 40 and above who underwent a chest x-ray (CXR) between 2016-2022 were included. ICD-10 and unstructured data were pulled from their electronic health records (EHRs) over the preceding 12 months to the CXR. The unstructured data were processed using named entity recognition to extract symptoms, which were mapped to SNOMED-CT codes. Subsumption of features up the SNOMED-CT hierarchy was used to mitigate against sparse features and a frequency-based criteria, combined with univariate logarithmic probabilities, was applied to select candidate features to take forward to the model development phase. A genetic algorithm was employed to identify the most discriminating features to form the diagnostic criteria. Results75002 patients were included, with 1012 lung cancer diagnoses made within 12 months of the CXR. The best-performing model achieved an AUROC of 0.72. Results showed that an existing disorder of the lung, such as pneumonia, and a cough increased the probability of a lung cancer diagnosis. Anomalies of great vessel, disorder of the retroperitoneal compartment and context-dependent findings, such as pain, statistically reduced the risk of lung cancer, making other diagnoses more likely. The performance of the developed model was compared to the existing cancer risk scores, demonstrating superior performance. ConclusionsThe proposed methods demonstrated success in leveraging unstructured secondary-care data to derive diagnostic criteria for lung cancer, outperforming existing risk tools. These advancements show potential for enhancing patient care and results. However, it is essential to tackle specific limitations by integrating primary care data to ensure a more thorough and unbiased development of diagnostic criteria. Moreover, the study highlights the importance of contextualising SNOMED-CT concepts into meaningful terminology that resonates with clinicians, facilitating a clearer and more tangible understanding of the criteria applied.

Matching journals

The top 3 journals account for 50% of the predicted probability mass.

1
JMIR Medical Informatics
17 papers in training set
Top 0.1%
22.0%
2
BMC Medical Informatics and Decision Making
39 papers in training set
Top 0.1%
19.0%
3
International Journal of Medical Informatics
25 papers in training set
Top 0.1%
14.0%
50% of probability mass above
4
Journal of Biomedical Informatics
45 papers in training set
Top 0.3%
6.2%
5
JAMIA Open
37 papers in training set
Top 0.4%
3.6%
6
PLOS ONE
4510 papers in training set
Top 41%
3.5%
7
Journal of Medical Internet Research
85 papers in training set
Top 2%
3.0%
8
Scientific Reports
3102 papers in training set
Top 43%
2.8%
9
BMJ Health & Care Informatics
13 papers in training set
Top 0.2%
2.7%
10
Artificial Intelligence in Medicine
15 papers in training set
Top 0.3%
1.7%
11
Journal of the American Medical Informatics Association
61 papers in training set
Top 1%
1.7%
12
BMC Medical Research Methodology
43 papers in training set
Top 0.7%
1.6%
13
Computers in Biology and Medicine
120 papers in training set
Top 3%
1.3%
14
Computer Methods and Programs in Biomedicine
27 papers in training set
Top 0.5%
1.3%
15
PLOS Digital Health
91 papers in training set
Top 2%
1.2%
16
Diagnostics
48 papers in training set
Top 2%
0.9%
17
Frontiers in Digital Health
20 papers in training set
Top 1%
0.9%
18
Informatics in Medicine Unlocked
21 papers in training set
Top 1%
0.8%
19
Journal of the American Heart Association
119 papers in training set
Top 4%
0.7%
20
npj Digital Medicine
97 papers in training set
Top 4%
0.7%
21
Biology Methods and Protocols
53 papers in training set
Top 3%
0.6%
22
JCO Clinical Cancer Informatics
18 papers in training set
Top 1%
0.6%