Back

Audited large language model triage for systematic review screening in national clinical guideline production: validation and prospective deployment

Fagerberg, P.; Sallander, O.; Vikhe Patil, K.; Thunborg, C.; Lundstrom, L.; Berg, A.; Nyman, A.; Borg, N.; Linden, T.

2026-06-03 health informatics
10.64898/2026.06.02.26354724 medRxiv
Show abstract

Title and abstract screening limit the timeliness of systematic reviews used for clinical guidelines. We evaluated audited large language model (LLM) triage at Sweden's National Board of Health and Welfare. Ten LLMs from five model families were tested on 419 Cochrane reviews comprising 26,892 records, and the selected ensemble was externally validated on 133 reviews including 8,501 records matched to planned guideline topics. The same locked model pair was then used prospectively across 24 systematic reviews in two national guideline programmes. On the 419-review selection benchmark, the selected Gemini-3-flash plus GPT-5.1 ensemble achieved 98.0% (95% CI, 97.3-98.7) mean review-level sensitivity, while topic-matched validation yielded 96.7% sensitivity (95% CI, 93.7-98.9). Prospective deployment screened 74,679 records, placed 63,858 (85.5%) in the AI-excluded pool and reduced estimated first-pass screening effort from 415 to 34 person-days. Across 600 randomly sampled AI-excluded records from the migraine and dementia programmes, none was confirmed as a final false negative after post-unblinding adjudication; across the completed 680-record audit, all 38 final retained records had been AI flagged, whereas locked blinded human consensus missed seven. These findings support locked, audited LLM triage, with human oversight and programme-specific monitoring, for systematic reviews used in national guidelines.

Matching journals

The top 5 journals account for 50% of the predicted probability mass.

1
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.2%
14.6%
2
BMC Medicine
163 papers in training set
Top 0.1%
12.3%
3
Journal of Clinical Epidemiology
28 papers in training set
Top 0.1%
12.3%
4
Research Synthesis Methods
20 papers in training set
Top 0.1%
6.3%
5
PLOS ONE
4510 papers in training set
Top 31%
4.8%
50% of probability mass above
6
Nature Communications
4913 papers in training set
Top 34%
4.8%
7
npj Digital Medicine
97 papers in training set
Top 1%
4.3%
8
BMJ
49 papers in training set
Top 0.2%
3.9%
9
PLOS Biology
408 papers in training set
Top 3%
3.6%
10
Nature Human Behaviour
85 papers in training set
Top 1%
3.2%
11
The Lancet Digital Health
25 papers in training set
Top 0.2%
2.3%
12
Annals of Internal Medicine
27 papers in training set
Top 0.4%
1.7%
13
Trials
25 papers in training set
Top 0.9%
1.5%
14
JAMA
17 papers in training set
Top 0.1%
1.5%
15
Philosophical Transactions of the Royal Society B
51 papers in training set
Top 3%
1.5%
16
Science Translational Medicine
111 papers in training set
Top 4%
1.2%
17
PLOS Medicine
98 papers in training set
Top 4%
0.9%
18
European Respiratory Journal
54 papers in training set
Top 2%
0.9%
19
eLife
5422 papers in training set
Top 54%
0.9%
20
Scientific Reports
3102 papers in training set
Top 73%
0.8%
21
BMJ Open
554 papers in training set
Top 13%
0.7%
22
JAMA Network Open
127 papers in training set
Top 5%
0.7%
23
eClinicalMedicine
55 papers in training set
Top 2%
0.6%
24
The Lancet Infectious Diseases
71 papers in training set
Top 3%
0.6%
25
Clinical and Translational Science
21 papers in training set
Top 1%
0.6%
26
Journal of Medical Internet Research
85 papers in training set
Top 5%
0.6%