Back

Large Language Model Performance in UK Advice & Guidance: A Pilot Study in Neurology

Healy, J.; Marvasti, A.; Wallace, D.; Baheerathan, A.; Ghosh, A.; Kossoff, J.; Thio, S.; Balaratnam, M.; Haider, S.; Ellershaw, S.; Dobson, R.

2026-05-18 neurology
10.64898/2026.05.13.26353081 medRxiv
Show abstract

Background: Large language models (LLMs) demonstrate strong performance in controlled medical environments such as multiple choice exams, but their utility in real-world clinical workflows remains unproven. The NHS Advice & Guidance (A&G) service, where Primary Care clinicians can submit text-based queries to specialists, provides an environment for evaluating the clinical performance of LLMs as a specialist. Methods: We compared responses from MedGemma 4B-IT, an open-weight model deployed locally on hospital infrastructure, against specialist neurologist responses across 50 adult neurology A&G cases from University College London Hospital. Two neurologists and two GPs rated 80 blinded and 20 unblinded responses for outcome, safety, efficacy, and feasibility using standardised criteria; outcome was a binary correct/incorrect, while other domains were scored 1-5. Inter-rater reliability was assessed using intraclass correlation coefficients. Results: Although there were no statistically significant differences between blinded specialist neurologists and LLM responses across any domain (outcome: 84% vs 82%, p=0.67; safety: 3.98 vs 4.02, p=0.85; efficacy: 4.06 vs 3.98, p=0.61; feasibility: 4.39 vs 4.20, p=0.45), 10% of LLM responses received concerning scores ([≤]2 average score) compared to 0% of human responses, indicating potentially clinically important tail risk. Furthermore, unblinded results showed a preference for human responses, with human ratings being preferred across all domains. Only 51% of binary outcomes had unanimous agreement and inter-rater agreement was moderate across other domains (ICC 0.50-0.52). Conclusions: In this pilot study, aggregate scores between blinded human and LLM responses were similar, and no statistically significant differences were detected in this exploratory sample. However, aggregate metrics masked clinically important edge-case failures in LLM responses. Pronounced inter-rater variability and the potential impact of LLM/human syntax on blinded rater judgements highlight the challenges in establishing robust evaluation frameworks for clinical LLM deployment

Matching journals

The top 11 journals account for 50% of the predicted probability mass.

1
PLOS ONE
4510 papers in training set
Top 21%
8.6%
2
npj Digital Medicine
97 papers in training set
Top 0.7%
7.3%
3
Frontiers in Neurology
91 papers in training set
Top 0.9%
6.5%
4
The Lancet Digital Health
25 papers in training set
Top 0.1%
4.3%
5
Scientific Reports
3102 papers in training set
Top 30%
4.1%
6
Emergency Medicine Journal
20 papers in training set
Top 0.2%
3.8%
7
BMJ Health & Care Informatics
13 papers in training set
Top 0.1%
3.7%
8
Journal of NeuroEngineering and Rehabilitation
28 papers in training set
Top 0.3%
3.7%
9
BMJ Open
554 papers in training set
Top 6%
3.1%
10
BMC Neurology
12 papers in training set
Top 0.2%
3.1%
11
Frontiers in Digital Health
20 papers in training set
Top 0.3%
2.8%
50% of probability mass above
12
Journal of Neurology, Neurosurgery & Psychiatry
29 papers in training set
Top 0.4%
2.8%
13
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.9%
2.7%
14
Annals of Clinical and Translational Neurology
29 papers in training set
Top 0.4%
2.1%
15
BMC Medicine
163 papers in training set
Top 3%
1.9%
16
Journal of the Neurological Sciences
17 papers in training set
Top 0.2%
1.7%
17
PLOS Digital Health
91 papers in training set
Top 1%
1.7%
18
EClinicalMedicine
21 papers in training set
Top 0.3%
1.5%
19
Brain Communications
147 papers in training set
Top 2%
1.4%
20
Epilepsia
49 papers in training set
Top 0.6%
1.1%
21
iScience
1063 papers in training set
Top 23%
1.1%
22
Nature Medicine
117 papers in training set
Top 4%
1.0%
23
British Journal of General Practice
22 papers in training set
Top 0.4%
1.0%
24
BMC Medical Informatics and Decision Making
39 papers in training set
Top 2%
0.9%
25
eBioMedicine
130 papers in training set
Top 3%
0.9%
26
Journal of Neurology
26 papers in training set
Top 1%
0.9%
27
Epilepsy Research
12 papers in training set
Top 0.3%
0.8%
28
Healthcare
16 papers in training set
Top 2%
0.8%
29
IEEE Journal of Biomedical and Health Informatics
34 papers in training set
Top 2%
0.8%
30
GigaScience
172 papers in training set
Top 3%
0.8%