Audited large language model triage for systematic review screening in national clinical guideline production: validation and prospective deployment

Fagerberg, P.; Sallander, O.; Vikhe Patil, K.; Thunborg, C.; Lundstrom, L.; Berg, A.; Nyman, A.; Borg, N.; Linden, T.

2026-06-03 health informatics

10.64898/2026.06.02.26354724 medRxiv

Show abstract

Title and abstract screening limit the timeliness of systematic reviews used for clinical guidelines. We evaluated audited large language model (LLM) triage at Sweden's National Board of Health and Welfare. Ten LLMs from five model families were tested on 419 Cochrane reviews comprising 26,892 records, and the selected ensemble was externally validated on 133 reviews including 8,501 records matched to planned guideline topics. The same locked model pair was then used prospectively across 24 systematic reviews in two national guideline programmes. On the 419-review selection benchmark, the selected Gemini-3-flash plus GPT-5.1 ensemble achieved 98.0% (95% CI, 97.3-98.7) mean review-level sensitivity, while topic-matched validation yielded 96.7% sensitivity (95% CI, 93.7-98.9). Prospective deployment screened 74,679 records, placed 63,858 (85.5%) in the AI-excluded pool and reduced estimated first-pass screening effort from 415 to 34 person-days. Across 600 randomly sampled AI-excluded records from the migraine and dementia programmes, none was confirmed as a final false negative after post-unblinding adjudication; across the completed 680-record audit, all 38 final retained records had been AI flagged, whereas locked blinded human consensus missed seven. These findings support locked, audited LLM triage, with human oversight and programme-specific monitoring, for systematic reviews used in national guidelines.

Audited large language model triage for systematic review screening in national clinical guideline production: validation and prospective deployment

Matching journals