Back

Structured retrieval closes the gap between low-cost and frontier clinical language models

Gorenshtein, A.; Sorka, M.; Omar, M.; Miron, K.; Hatav, A.; Barash, Y.; Klang, E.; Shelly, S.

2026-03-24 neurology
10.64898/2026.03.22.26349018 medRxiv
Show abstract

Most clinical large language model (LLM) benchmarks rely on clean, concise vignettes that do not reflect the noisy, long-form documentation typical of real clinical records. How LLM performance degrades under realistic chart conditions remains poorly characterised. Here we test whether structured retrieval workflows protect National Institutes of Health Stroke Scale (NIHSS) scoring accuracy under systematic context stress. Using 100 de-identified acute stroke cases and a fully crossed 4 x 4 x 3 x 3 condition matrix (144 conditions per case), we vary context acquisition method, document length, distractor load and critical-information position across four Gemini models (57,047 retained runs). Structured retrieval reduces mean absolute error (MAE) from 4.58 to 2.96 points relative to non-agentic baselines (mean gain 1.62 MAE points; 95% CI 1.57 to 1.67; 35% relative reduction), with consistent gains across all 36 stress combinations. Lower-cost models show disproportionately larger gains (2.76 versus 0.45 MAE points). Tool-retrieved pipelines outperform retrieval-augmented generation in 33 of 36 combinations. These findings indicate that retrieval architecture, rather than model scale alone, is a tractable lever for robust, equitable clinical LLM deployment.

Matching journals

The top 3 journals account for 50% of the predicted probability mass.

1
npj Digital Medicine
97 papers in training set
Top 0.1%
28.4%
2
Nature Medicine
117 papers in training set
Top 0.1%
18.0%
3
Med
38 papers in training set
Top 0.1%
7.4%
50% of probability mass above
4
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.4%
7.0%
5
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 19%
3.7%
6
Scientific Reports
3102 papers in training set
Top 34%
3.7%
7
Nature Computational Science
50 papers in training set
Top 0.2%
3.0%
8
The Lancet Digital Health
25 papers in training set
Top 0.2%
2.7%
9
PLOS Digital Health
91 papers in training set
Top 1%
1.7%
10
Nature Communications
4913 papers in training set
Top 54%
1.4%
11
PLOS ONE
4510 papers in training set
Top 60%
1.3%
12
Nature Biomedical Engineering
42 papers in training set
Top 1%
1.3%
13
Genome Medicine
154 papers in training set
Top 7%
0.9%
14
BMC Medicine
163 papers in training set
Top 6%
0.9%
15
Artificial Intelligence in Medicine
15 papers in training set
Top 0.6%
0.8%
16
Nucleic Acids Research
1128 papers in training set
Top 16%
0.8%
17
Nature Human Behaviour
85 papers in training set
Top 4%
0.8%
18
Brain
154 papers in training set
Top 4%
0.8%
19
iScience
1063 papers in training set
Top 33%
0.7%
20
Nature Neuroscience
216 papers in training set
Top 7%
0.7%
21
Nature Genetics
240 papers in training set
Top 8%
0.7%
22
Cell Reports Medicine
140 papers in training set
Top 10%
0.5%
23
Journal of Biomedical Informatics
45 papers in training set
Top 2%
0.5%
24
Nature
575 papers in training set
Top 17%
0.5%
25
eLife
5422 papers in training set
Top 63%
0.5%
26
Communications Medicine
85 papers in training set
Top 2%
0.5%