Artificial Intelligence for Automated, Highly Accurate, and Scalable Multimodal EHR Data Abstraction
Margaritis, G.; Petridis, P.; Bertsimas, D.; Bloom, J.; Hagberg, R.; Habib, R.; Shahian, D. M.; Orfanoudaki, A.
Show abstract
Electronic health records (EHRs) contain rich multimodal data but remain underutilized for populating clinical registries due to the time and cost of manual abstraction. We developed an AI-driven pipeline to automate data abstraction for variables in the Society of Thoracic Surgeons Adult Cardiac Surgery Database (ACSD). Models were developed using Mass General Brigham data and externally validated on Hartford HealthCare data. The pipeline processes ten clinical EHR sources, seven unstructured text types and three structured data types; each encoded using two language-model embeddings and term frequency-inverse document frequency. This approach yielded 30 source-specific models per target variable whose predictions were aggregated by an ensemble meta-learner, followed by a dual-threshold confidence framework that enforced registry-grade high accuracy standards and deferred uncertain predictions to human review. The developed pipeline achieved an overall accuracy exceeding 99% across 647 registry variables, while automatically completing 49.5% and 43.2% of variables at both sites, respectively. These results demonstrate that AI-assisted abstraction can substantially reduce clinical registry data collection burden while maintaining high accuracy.
Matching journals
The top 2 journals account for 50% of the predicted probability mass.