Back

Benchmarking General-Purpose and Medical AI Large Language Models for Clinical Assessment and Management in Parkinson's Disease

Shechter, Y.; Klevor, R.; Kouchache, T.; Bouhadoun, S.; Postuma, R. B.

2026-05-20 neurology
10.64898/2026.05.13.26353021 medRxiv
Show abstract

Background: The clinical applicability of large language models (LLMs) in Parkinson's disease (PD) management remains insufficiently characterized, particularly in generative responses to clinical vignette scenarios. Objective: To evaluate the quality of clinical assessments and management plans generated by a general-purpose LLM (Gemini 1.5 Pro) and a medically specialized LLM (OpenEvidence), and to compare their performance. Methods: Models generated free-text responses to 45 open clinical queries, focused on assessment of the situation, and recommended management plan. Two movement disorders fellows rated outputs using 5-point Likert scales, dichotomized into clinically appropriate ([≥]4) versus inappropriate ([≤]3). Discrepancies were adjudicated by a senior movement disorders specialist. Paired comparisons used McNemar's test; qualitative analysis examined severe errors. Results: Gemini 1.5 Pro and OpenEvidence showed high rates of clinically appropriate assessments (80.0% vs. 86.7%) but lower performance in management plans (48.9% vs. 57.8%). Cases in which both assessment and plan were clinically appropriate occurred in 46.7% and 55.6% of cases, respectively. None of these differences reached statistical significance. Severe errors were uncommon in assessments (6.7% vs. 8.9%) but more frequent in plans (26.7% in both), predominantly reflecting treatment strategy errors. Conclusions: In generative clinical reasoning tasks involving Parkinson's disease management vignettes, LLMs demonstrated reasonable performance in assessment, but consistent limitations in plan generation. The medically specialized LLM demonstrated several qualitative advantages but no statistically significant performance benefit over the general-purpose model. Therefore, these tools should be used with appropriate caution in Parkinson's disease management, particularly regarding treatment recommendations.

Matching journals

The top 10 journals account for 50% of the predicted probability mass.

1
npj Digital Medicine
97 papers in training set
Top 0.5%
10.3%
2
Parkinsonism & Related Disorders
21 papers in training set
Top 0.1%
8.5%
3
PLOS ONE
4510 papers in training set
Top 26%
6.5%
4
BMC Neurology
12 papers in training set
Top 0.1%
4.9%
5
Journal of NeuroEngineering and Rehabilitation
28 papers in training set
Top 0.2%
4.9%
6
Scientific Reports
3102 papers in training set
Top 22%
4.9%
7
Frontiers in Digital Health
20 papers in training set
Top 0.2%
3.6%
8
Frontiers in Neurology
91 papers in training set
Top 2%
3.6%
9
Movement Disorders
62 papers in training set
Top 0.6%
1.8%
10
Journal of the Neurological Sciences
17 papers in training set
Top 0.2%
1.8%
50% of probability mass above
11
Journal of Neurology
26 papers in training set
Top 0.6%
1.7%
12
Journal of the American Medical Informatics Association
61 papers in training set
Top 1%
1.7%
13
npj Parkinson's Disease
89 papers in training set
Top 0.7%
1.7%
14
Journal of Parkinson's Disease
13 papers in training set
Top 0.2%
1.7%
15
Journal of Neurology, Neurosurgery & Psychiatry
29 papers in training set
Top 0.7%
1.7%
16
Artificial Intelligence in Medicine
15 papers in training set
Top 0.4%
1.5%
17
Orphanet Journal of Rare Diseases
18 papers in training set
Top 0.4%
1.4%
18
BMJ Health & Care Informatics
13 papers in training set
Top 0.5%
1.4%
19
Neurology Genetics
14 papers in training set
Top 0.1%
1.4%
20
PeerJ
261 papers in training set
Top 9%
1.4%
21
Brain Communications
147 papers in training set
Top 2%
1.2%
22
PLOS Digital Health
91 papers in training set
Top 2%
1.2%
23
Cortex
102 papers in training set
Top 0.4%
1.0%
24
Annals of Clinical and Translational Neurology
29 papers in training set
Top 0.9%
0.9%
25
Computers in Biology and Medicine
120 papers in training set
Top 4%
0.9%
26
BMC Medical Informatics and Decision Making
39 papers in training set
Top 2%
0.9%
27
Journal of Cognitive Neuroscience
119 papers in training set
Top 1%
0.8%
28
JMIR Formative Research
32 papers in training set
Top 1%
0.8%
29
BMJ Open
554 papers in training set
Top 12%
0.8%
30
Brain Sciences
52 papers in training set
Top 2%
0.7%