Evaluating a Locally Deployed 20-Billion Parameter Large Language Model for Automated Abstract Screening in Systematic Reviews
Moreira Melo, P. H.; Poenaru, D.; Guadagno, E.
Show abstract
BackgroundSystematic reviews (SRs) are essential for evidence-based medicine but require extensive time and resources for abstract screening. Large language models (LLMs) offer potential for automating this process, yet concerns about data privacy, intellectual property protection, and reproducibility limit the use of cloud-based solutions in research settings. ObjectiveTo evaluate the performance of a locally deployed 20-billion parameter LLM for automated abstract screening in systematic reviews using a sensitivity-enhanced prompting strategy, with blind expert adjudication of all discordant human-AI cases. MethodsWe deployed GPT-OSS:20B locally using Ollama and evaluated its performance across three systematic reviews: AI applications in pediatric surgical pathology (n=3,350), LLM applications in electronic health records (n=4,326), and parental stress/caregiver burden in surgically treated children (n=8,970). A sensitivity-enhanced prompting strategy instructing the model to include abstracts when uncertain was employed. All discordant cases underwent blind expert adjudication. ResultsAcross 16,646 abstracts, the LLM demonstrated variable sensitivity after expert adjudication: 100% in SR1, 95.7% in SR2, and 85.7% in SR3. Expert adjudication identified 11 human screening errors across all reviews that the LLM had correctly classified. The LLM completed screening 4.7 times faster than human reviewers. ConclusionsA locally deployed LLM with sensitivity-enhanced prompting shows promising performance for systematic review abstract screening, particularly for technology-focused topics. Performance variability across domains suggests that screening accuracy depends partly on the objectivity of inclusion criteria. We recommend deploying LLMs as second screeners alongside human reviewers until performance is more fully validated across diverse domains.
Matching journals
The top 3 journals account for 50% of the predicted probability mass.