Back

Natural Language Processing for Clinical Laboratory Data Repository Systems: Implementation and Evaluation for Respiratory Viruses

Dolatabadi, E.; Chen, B.; Buchan, S. A.; Austin, A. M.; Azimaee, M.; McGeer, A.; Mubareka, S.; Kwong, J. C.

2022-11-29 health informatics
10.1101/2022.11.28.22282767 medRxiv
Show abstract

BackgroundWith the growing volume and complexity of laboratory repositories, it has become tedious to parse unstructured data into structured and tabulated formats for secondary uses such as decision support, quality assurance, and outcome analysis. However, advances in Natural Language Processing (NLP) approaches have enabled efficient and automated extraction of clinically meaningful medical concepts from unstructured reports. ObjectiveIn this study, we aimed to determine the feasibility of using the NLP model for information extraction as an alternative approach to a time-consuming and operationally resource-intensive handcrafted rule-based tool. Therefore, we sought to develop and evaluate a deep learning-based NLP model to derive knowledge and extract information from text-based laboratory reports sourced from a provincial laboratory repository system. MethodsThe NLP model, a hierarchical multi-label classifier, was trained on a corpus of laboratory reports covering testing for 14 different respiratory viruses and viral subtypes. The corpus included 85k unique laboratory reports annotated by eight Subject Matter Experts (SME). The models performance stability and variation were analyzed across fine-grained and coarse-grained classes. Moreover, the models generalizability was also evaluated internally and externally on various test sets. ResultsThe NLP model was trained several times with random initialization on the development corpus, and the results of the top ten best-performing models are presented in this paper. Overall, the NLP model performed well on internal, out-of-time (pre-COVID-19), and external (different laboratories) test sets with micro-averaged F1 scores >94% across all classes. Higher Precision and Recall scores with less variability were observed for the internal and pre-COVID-19 test sets. As expected, the models performance varied across categories and virus types due to the imbalanced nature of the corpus and sample sizes per class. There were intrinsically fewer classes of viruses being detected than those tested; therefore, the models performance (lowest F1-score of 57%) was noticeably lower in the "detected" cases. ConclusionsWe demonstrated that deep learning-based NLP models are promising solutions for information extraction from text-based laboratory reports. These approaches enable scalable, timely, and practical access to high-quality and encoded laboratory data if integrated into laboratory information system repositories.

Matching journals

The top 4 journals account for 50% of the predicted probability mass.

1
International Journal of Medical Informatics
25 papers in training set
Top 0.1%
25.7%
2
BMC Medical Informatics and Decision Making
39 papers in training set
Top 0.1%
14.2%
3
JAMIA Open
37 papers in training set
Top 0.1%
8.3%
4
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.5%
6.3%
50% of probability mass above
5
JMIR Medical Informatics
17 papers in training set
Top 0.2%
4.8%
6
Journal of Medical Internet Research
85 papers in training set
Top 1%
4.3%
7
Journal of Biomedical Informatics
45 papers in training set
Top 0.3%
4.3%
8
Frontiers in Digital Health
20 papers in training set
Top 0.3%
3.6%
9
Scientific Reports
3102 papers in training set
Top 46%
2.6%
10
Artificial Intelligence in Medicine
15 papers in training set
Top 0.3%
1.8%
11
BMJ Health & Care Informatics
13 papers in training set
Top 0.4%
1.7%
12
Biology Methods and Protocols
53 papers in training set
Top 1.0%
1.7%
13
Informatics in Medicine Unlocked
21 papers in training set
Top 0.5%
1.6%
14
PLOS ONE
4510 papers in training set
Top 59%
1.3%
15
JCO Clinical Cancer Informatics
18 papers in training set
Top 0.6%
1.3%
16
BMC Bioinformatics
383 papers in training set
Top 5%
1.2%
17
JMIR Public Health and Surveillance
45 papers in training set
Top 3%
1.2%
18
Cureus
67 papers in training set
Top 5%
0.7%
19
Journal of the American Heart Association
119 papers in training set
Top 4%
0.7%
20
American Journal of Infection Control
12 papers in training set
Top 0.4%
0.7%
21
Computers in Biology and Medicine
120 papers in training set
Top 5%
0.7%
22
BMC Medical Research Methodology
43 papers in training set
Top 2%
0.6%