Back

Automation of Systematic Reviews with Large Language Models

Cao, C.; Arora, R.; Cento, P.; Manta, K.; Farahani, E.; Cecere, M.; Selemon, A.; Sang, J.; Gong, L. X.; Kloosterman, R.; Jiang, S.; Saleh, R.; Margalik, D.; Lin, J.; Jomy, J.; Xie, J.; Chen, D.; Gorla, J.; Lee, S.; Zhang, K.; Ware, H.; Whelan, M. G.; Teja, B.; Leung, A. A.; Ghosn, L.; Arora, R. K.; Noetel, M.; Emerson, D. B.; Boutron, I.; Moher, D.; Church, G. M.; Bobrovitz, N.

2025-06-13 health informatics

10.1101/2025.06.13.25329541 medRxiv

Show abstract

Structured AbstractO_ST_ABSImportanceC_ST_ABSSystematic reviews (SRs) inform evidence-based decision making. Yet, many take over a year to complete, are labor intensive, prone to human error, and face reproducibility challenges; thus limiting access to timely and reliable information. ObjectiveTo validate a large language model (LLM)-based workflow (otto-SR) to automate three of the most labour intensive tasks in performing SRs: article screening, data extraction, and risk of bias assessment; and to assess its feasibility in rapidly updating existing reviews. Design, setting, and participantsWe conducted a validation study in four phases, with direct benchmarking against graduate-level human researchers in phases 1 and 2. Phase 1: article screening performance was measured across 32,357 citations from 5 systematic reviews. The reference standard consisted of the original reviews screening decisions after full-text screening. Phase 2: data extraction performance was measured across 4,495 data points from 495 studies in 7 reviews. Phase 3: risk of bias assessment (ROB2, Newcastle-Ottawa, QUADAS2) performance was measured across 345 studies from 12 reviews. Reference standards for Phase 2 and Phase 3 were created after blinded adjudication of the original review extraction and RoB assessments. Phase 4: otto-SR was used to reproduce and update the primary analysis from an issue of Cochrane reviews (n=12 reviews, 146,276 citations), with analytical comparisons to the original meta-analyzed findings. All discrepancies underwent dual human review. Resultsotto-SR showed high performance in phase 1 article screening (otto-SR: 96.7% sensitivity, 97.9% specificity; human: 81.7% sensitivity, 98.1% specificity) and phase 2 data extraction (otto-SR: 93.1% accuracy; human: 79.7% accuracy). In phase 3, otto-SR demonstrated high interrater reliability for risk of bias judgements (ROB2 0.98, Newcastle-Ottawa 0.95, QUADAS2 0.74; Gwet AC2). In phase 4, otto-SR, reproduced and updated the primary analysis from an issue of Cochrane reviews. Across Cochrane reviews, otto-SR incorrectly excluded a median of 0 studies (IQR 0 to 0.25), and found nearly twice as many eligible studies compared to the original authors (n= 114 vs. 64). Meta-analyses based on otto-SR generated screening and extraction outputs, subsequently verified through dual human review, yielded newly statistically significant effect estimates in 2 reviews and negated significance in 1 review. Conclusions and relevanceLLMs have high performance in article screening, data extraction, and risk of bias assessments. They can rapidly reproduce and update existing systematic reviews, laying the foundation for automated, scalable, and reliable evidence synthesis.

Automation of Systematic Reviews with Large Language Models

Matching journals