Back

Artificial Intelligence for Automated, Highly Accurate, and Scalable Multimodal EHR Data Abstraction

Margaritis, G.; Petridis, P.; Bertsimas, D.; Bloom, J.; Hagberg, R.; Habib, R.; Shahian, D. M.; Orfanoudaki, A.

2026-03-17 health informatics
10.64898/2026.03.16.26348522 medRxiv
Show abstract

Electronic health records (EHRs) contain rich multimodal data but remain underutilized for populating clinical registries due to the time and cost of manual abstraction. We developed an AI-driven pipeline to automate data abstraction for variables in the Society of Thoracic Surgeons Adult Cardiac Surgery Database (ACSD). Models were developed using Mass General Brigham data and externally validated on Hartford HealthCare data. The pipeline processes ten clinical EHR sources, seven unstructured text types and three structured data types; each encoded using two language-model embeddings and term frequency-inverse document frequency. This approach yielded 30 source-specific models per target variable whose predictions were aggregated by an ensemble meta-learner, followed by a dual-threshold confidence framework that enforced registry-grade high accuracy standards and deferred uncertain predictions to human review. The developed pipeline achieved an overall accuracy exceeding 99% across 647 registry variables, while automatically completing 49.5% and 43.2% of variables at both sites, respectively. These results demonstrate that AI-assisted abstraction can substantially reduce clinical registry data collection burden while maintaining high accuracy.

Matching journals

The top 2 journals account for 50% of the predicted probability mass.

1
npj Digital Medicine
97 papers in training set
Top 0.1%
49.7%
2
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.4%
7.1%
50% of probability mass above
3
Journal of Biomedical Informatics
45 papers in training set
Top 0.3%
4.3%
4
JMIR Medical Informatics
17 papers in training set
Top 0.2%
4.1%
5
Scientific Reports
3102 papers in training set
Top 35%
3.6%
6
BMC Medical Informatics and Decision Making
39 papers in training set
Top 1%
2.6%
7
International Journal of Medical Informatics
25 papers in training set
Top 0.7%
1.9%
8
JAMIA Open
37 papers in training set
Top 0.8%
1.7%
9
JCO Clinical Cancer Informatics
18 papers in training set
Top 0.5%
1.7%
10
Nature Communications
4913 papers in training set
Top 52%
1.7%
11
Journal of Medical Internet Research
85 papers in training set
Top 3%
1.2%
12
IEEE Journal of Biomedical and Health Informatics
34 papers in training set
Top 1%
1.2%
13
PLOS Digital Health
91 papers in training set
Top 2%
1.2%
14
Nature Biomedical Engineering
42 papers in training set
Top 1%
1.2%
15
Nature Machine Intelligence
61 papers in training set
Top 3%
1.1%
16
Nature Medicine
117 papers in training set
Top 4%
0.9%
17
Med
38 papers in training set
Top 0.5%
0.9%
18
Advanced Science
249 papers in training set
Top 17%
0.9%
19
Patterns
70 papers in training set
Top 2%
0.9%
20
Communications Biology
886 papers in training set
Top 24%
0.7%
21
Artificial Intelligence in Medicine
15 papers in training set
Top 0.8%
0.7%
22
Frontiers in Digital Health
20 papers in training set
Top 2%
0.7%
23
PLOS ONE
4510 papers in training set
Top 71%
0.6%
24
The Lancet Digital Health
25 papers in training set
Top 1%
0.6%