Back

Evaluating a Locally Deployed 20-Billion Parameter Large Language Model for Automated Abstract Screening in Systematic Reviews

Moreira Melo, P. H.; Poenaru, D.; Guadagno, E.

2026-03-04 health informatics
10.64898/2026.03.04.26347506 medRxiv
Show abstract

BackgroundSystematic reviews (SRs) are essential for evidence-based medicine but require extensive time and resources for abstract screening. Large language models (LLMs) offer potential for automating this process, yet concerns about data privacy, intellectual property protection, and reproducibility limit the use of cloud-based solutions in research settings. ObjectiveTo evaluate the performance of a locally deployed 20-billion parameter LLM for automated abstract screening in systematic reviews using a sensitivity-enhanced prompting strategy, with blind expert adjudication of all discordant human-AI cases. MethodsWe deployed GPT-OSS:20B locally using Ollama and evaluated its performance across three systematic reviews: AI applications in pediatric surgical pathology (n=3,350), LLM applications in electronic health records (n=4,326), and parental stress/caregiver burden in surgically treated children (n=8,970). A sensitivity-enhanced prompting strategy instructing the model to include abstracts when uncertain was employed. All discordant cases underwent blind expert adjudication. ResultsAcross 16,646 abstracts, the LLM demonstrated variable sensitivity after expert adjudication: 100% in SR1, 95.7% in SR2, and 85.7% in SR3. Expert adjudication identified 11 human screening errors across all reviews that the LLM had correctly classified. The LLM completed screening 4.7 times faster than human reviewers. ConclusionsA locally deployed LLM with sensitivity-enhanced prompting shows promising performance for systematic review abstract screening, particularly for technology-focused topics. Performance variability across domains suggests that screening accuracy depends partly on the objectivity of inclusion criteria. We recommend deploying LLMs as second screeners alongside human reviewers until performance is more fully validated across diverse domains.

Matching journals

The top 3 journals account for 50% of the predicted probability mass.

1
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.1%
42.0%
2
npj Digital Medicine
97 papers in training set
Top 0.7%
6.7%
3
Journal of Biomedical Informatics
45 papers in training set
Top 0.4%
3.8%
50% of probability mass above
4
Journal of Medical Internet Research
85 papers in training set
Top 1%
3.8%
5
Research Synthesis Methods
20 papers in training set
Top 0.1%
2.9%
6
Nature Communications
4913 papers in training set
Top 43%
2.8%
7
Journal of Clinical Epidemiology
28 papers in training set
Top 0.2%
2.5%
8
JAMIA Open
37 papers in training set
Top 0.6%
2.2%
9
PLOS ONE
4510 papers in training set
Top 52%
1.8%
10
BMJ Health & Care Informatics
13 papers in training set
Top 0.4%
1.8%
11
BMC Medicine
163 papers in training set
Top 3%
1.8%
12
The Lancet Digital Health
25 papers in training set
Top 0.3%
1.8%
13
JMIR Medical Informatics
17 papers in training set
Top 0.8%
1.6%
14
JCO Clinical Cancer Informatics
18 papers in training set
Top 0.5%
1.4%
15
Scientific Reports
3102 papers in training set
Top 65%
1.3%
16
PLOS Biology
408 papers in training set
Top 13%
1.3%
17
Annals of Internal Medicine
27 papers in training set
Top 0.7%
0.9%
18
International Journal of Medical Informatics
25 papers in training set
Top 1%
0.9%
19
BMC Medical Research Methodology
43 papers in training set
Top 1.0%
0.9%
20
Philosophical Transactions of the Royal Society B
51 papers in training set
Top 5%
0.8%
21
PLOS Digital Health
91 papers in training set
Top 3%
0.8%
22
JAMA Network Open
127 papers in training set
Top 5%
0.7%
23
Med
38 papers in training set
Top 1.0%
0.7%
24
Nature Human Behaviour
85 papers in training set
Top 5%
0.5%
25
Bioinformatics
1061 papers in training set
Top 10%
0.5%
26
European Respiratory Journal
54 papers in training set
Top 2%
0.5%
27
BMC Bioinformatics
383 papers in training set
Top 8%
0.5%
28
Artificial Intelligence in Medicine
15 papers in training set
Top 0.9%
0.5%
29
BMC Medical Informatics and Decision Making
39 papers in training set
Top 3%
0.5%
30
BMJ
49 papers in training set
Top 1%
0.5%