Back

Approach to Machine Learning for Extraction of Real-World Data Variables from Electronic Health Records

Adamson, B. J.; Waskom, M.; Blarre, A.; Kelly, J.; Krismer, K.; Nemeth, S.; Gipetti, J.; Ritten, J.; Harrison, K.; Ho, G.; Linzmayer, R.; Bansal, T.; Wilkinson, S.; Amster, G.; Estola, E.; Benedum, C. M.; Fidyk, E.; Estevez, M.; Shapiro, W.; Cohen, A. B.

2023-03-06 oncology
10.1101/2023.03.02.23286522 medRxiv
Show abstract

BackgroundAs artificial intelligence (AI) continues to advance with breakthroughs in natural language processing (NLP) and machine learning (ML), such as the development of models like OpenAIs ChatGPT, new opportunities are emerging for efficient curation of electronic health records (EHR) into real-world data (RWD) for evidence generation in oncology. Our objective is to describe the research and development of industry methods to promote transparency and explainability. MethodsWe applied NLP with ML techniques to train, validate, and test the extraction of information from unstructured documents (eg, clinician notes, radiology reports, lab reports, etc.) to output a set of structured variables required for RWD analysis. This research used a nationwide electronic health record (EHR)-derived database. Models were selected based on performance. Variables curated with an approach using ML extraction are those where the value is determined solely based on an ML model (ie, not confirmed by abstraction), which identifies key information from visit notes and documents. These models do not predict future events or infer missing information. ResultsWe developed an approach using NLP and ML for extraction of clinically meaningful information from unstructured EHR documents and found high performance of output variables compared with variables curated by manually abstracted data. These extraction methods resulted in research-ready variables including initial cancer diagnosis with date, advanced/metastatic diagnosis with date, disease stage, histology, smoking status, surgery status with date, biomarker test results with dates, and oral treatments with dates. ConclusionsNLP and ML enable the extraction of retrospective clinical data in EHR with speed and scalability to help researchers learn from the experience of every person with cancer.

Matching journals

The top 7 journals account for 50% of the predicted probability mass.

1
JCO Clinical Cancer Informatics
18 papers in training set
Top 0.1%
14.1%
2
Artificial Intelligence in Medicine
15 papers in training set
Top 0.1%
12.1%
3
PLOS ONE
4510 papers in training set
Top 28%
6.3%
4
BMC Medical Informatics and Decision Making
39 papers in training set
Top 0.5%
6.3%
5
Scientific Reports
3102 papers in training set
Top 20%
6.2%
6
Biology Methods and Protocols
53 papers in training set
Top 0.1%
4.8%
7
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.6%
4.8%
50% of probability mass above
8
International Journal of Medical Informatics
25 papers in training set
Top 0.4%
3.5%
9
Database
51 papers in training set
Top 0.2%
3.0%
10
Frontiers in Oncology
95 papers in training set
Top 1%
2.8%
11
BMC Bioinformatics
383 papers in training set
Top 3%
2.7%
12
PeerJ
261 papers in training set
Top 4%
2.3%
13
JAMIA Open
37 papers in training set
Top 0.7%
1.9%
14
BMC Research Notes
29 papers in training set
Top 0.1%
1.7%
15
iScience
1063 papers in training set
Top 16%
1.7%
16
PLOS Computational Biology
1633 papers in training set
Top 17%
1.7%
17
JMIR Medical Informatics
17 papers in training set
Top 0.8%
1.6%
18
npj Digital Medicine
97 papers in training set
Top 2%
1.5%
19
Cancer Medicine
24 papers in training set
Top 0.9%
1.3%
20
Journal of Medical Internet Research
85 papers in training set
Top 3%
1.2%
21
Journal of Translational Medicine
46 papers in training set
Top 2%
0.9%
22
Computers in Biology and Medicine
120 papers in training set
Top 4%
0.9%
23
Cureus
67 papers in training set
Top 5%
0.7%
24
BMC Infectious Diseases
118 papers in training set
Top 5%
0.7%
25
Frontiers in Bioinformatics
45 papers in training set
Top 1.0%
0.7%
26
Journal of Biomedical Informatics
45 papers in training set
Top 2%
0.7%
27
BMC Cancer
52 papers in training set
Top 3%
0.7%
28
NAR Genomics and Bioinformatics
214 papers in training set
Top 4%
0.6%