Back

Approach to Machine Learning for Extraction of Real-World Data Variables from Electronic Health Records

Adamson, B. J.; Waskom, M.; Blarre, A.; Kelly, J.; Krismer, K.; Nemeth, S.; Gipetti, J.; Ritten, J.; Harrison, K.; Ho, G.; Linzmayer, R.; Bansal, T.; Wilkinson, S.; Amster, G.; Estola, E.; Benedum, C. M.; Fidyk, E.; Estevez, M.; Shapiro, W.; Cohen, A. B.

2023-03-06 oncology
10.1101/2023.03.02.23286522
Show abstract

BackgroundAs artificial intelligence (AI) continues to advance with breakthroughs in natural language processing (NLP) and machine learning (ML), such as the development of models like OpenAIs ChatGPT, new opportunities are emerging for efficient curation of electronic health records (EHR) into real-world data (RWD) for evidence generation in oncology. Our objective is to describe the research and development of industry methods to promote transparency and explainability. MethodsWe applied NLP with ML techniques to train, validate, and test the extraction of information from unstructured documents (eg, clinician notes, radiology reports, lab reports, etc.) to output a set of structured variables required for RWD analysis. This research used a nationwide electronic health record (EHR)-derived database. Models were selected based on performance. Variables curated with an approach using ML extraction are those where the value is determined solely based on an ML model (ie, not confirmed by abstraction), which identifies key information from visit notes and documents. These models do not predict future events or infer missing information. ResultsWe developed an approach using NLP and ML for extraction of clinically meaningful information from unstructured EHR documents and found high performance of output variables compared with variables curated by manually abstracted data. These extraction methods resulted in research-ready variables including initial cancer diagnosis with date, advanced/metastatic diagnosis with date, disease stage, histology, smoking status, surgery status with date, biomarker test results with dates, and oral treatments with dates. ConclusionsNLP and ML enable the extraction of retrospective clinical data in EHR with speed and scalability to help researchers learn from the experience of every person with cancer.

Matching journals

The top 5 journals account for 50% of the predicted probability mass.

1
JCO Clinical Cancer Informatics
based on 14 papers
Top 0.1%
18.3%
2
PLOS ONE
based on 1737 papers
Top 26%
16.6%
3
Scientific Reports
based on 701 papers
Top 26%
6.9%
4
Journal of the American Medical Informatics Association
based on 53 papers
Top 2%
5.7%
5
Cancer Medicine
based on 17 papers
Top 0.9%
3.2%
50% of probability mass above
6
npj Precision Oncology
based on 14 papers
Top 0.7%
3.0%
7
International Journal of Medical Informatics
based on 25 papers
Top 1%
3.0%
8
PeerJ
based on 46 papers
Top 1%
3.0%
9
Cancers
based on 57 papers
Top 4%
2.6%
10
Computers in Biology and Medicine
based on 39 papers
Top 2%
2.6%
11
Diagnostics
based on 36 papers
Top 3%
1.7%
12
International Journal of Radiation Oncology*Biology*Physics
based on 13 papers
Top 2%
1.4%
13
Frontiers in Oncology
based on 34 papers
Top 4%
1.4%
14
JCO Precision Oncology
based on 11 papers
Top 2%
1.4%
15
JMIR Formative Research
based on 31 papers
Top 3%
1.4%
16
BMJ Health & Care Informatics
based on 13 papers
Top 2%
1.4%
17
Journal of Medical Internet Research
based on 81 papers
Top 10%
1.4%
18
Cureus
based on 64 papers
Top 11%
1.4%
19
Biology Methods and Protocols
based on 19 papers
Top 2%
0.9%
20
PLOS Computational Biology
based on 141 papers
Top 9%
0.9%
21
BMC Medical Informatics and Decision Making
based on 36 papers
Top 6%
0.9%
22
JMIR Medical Informatics
based on 16 papers
Top 4%
0.9%
23
JMIR Research Protocols
based on 18 papers
Top 4%
0.7%