Back

AI-Assisted Data Extraction with a Large Language Model: A Study Within Reviews

Gartlehner, G.; Kugley, S.; Crotty, K.; Viswanathan, M.; Dobrescu, A.; Nussbaumer-Streit, B.; Booth, G.; Treadwell, J.; Han, J. M.; Wagner, J.; Apaydin, E.; Coppola, E.; Maglione, M.; Hilscher, R.; Chew, R.; Pilar, M.; Swanton, B.; Kahwati, L.

2025-03-21 health systems and quality improvement
10.1101/2025.03.20.25324350 medRxiv
Show abstract

BackgroundData extraction is a critical but error-prone and labor-intensive task in evidence synthesis. Unlike other artificial intelligence (AI) technologies, large language models (LLMs) do not require labeled training data for data extraction. ObjectiveTo compare an AI-assisted to a traditional y data extraction process. DesignStudy within reviews (SWAR) utilizing a prospective, parallel group comparison with blinded data adjudicators. SettingWorkflow validation within six ongoing systematic reviews of interventions under real-world conditions. InterventionInitial data extraction using an LLM (Claude versions 2.1, 3.0 Opus, and 3.5 Sonnet) verified by a human reviewer. MeasurementsConcordance, time on task, accuracy, recall, precision, and error analysis. ResultsThe six systematic reviews of the SWAR contributed 9,341 data elements, extracted from 63 studies. Concordance between the two methods was 77.2%. The accuracy of the AI-assisted approach compared with enhanced human data extraction was 91.0%, with a recall of 89.4% and a precision of 98.9%. The AI-assisted approach had fewer incorrect extractions (9.0% vs. 11.0%) and similar risks of major errors (2.5% vs. 2.7%) compared to the traditional human-only method, with a median time saving of 41 minutes per study. Missed data items were the most frequent errors in both approaches. LimitationsAssessing the concordance of data extractions and classifying errors required subjective judgment. Tracking time on task consistently was challenging. ConclusionThe use of an LLM can improve accuracy of data extraction and save time in evidence synthesis. Results reinforce previous findings that human-only data extraction is prone to errors. Primary Funding SourceUS Agency for Healthcare Research and Quality, RTI International RegistrationSWAR28 Gerald Gartlehner (2023 FEB 11 2102).pdf

Matching journals

The top 2 journals account for 50% of the predicted probability mass.