Back

Diagnostic Accuracy and Clinical Reasoning of Multiple Large Language Models in Psychiatry

Jin, K. W.; Rostam-Abadi, Y.; Chaudhary, P.; Garrett, M. A.; Huang, A. S.; Montelongo, M.; Nagpal, C.; Shei, J.; Weathers, J.; Zhang, J. S.; Chen, Q.; Kim, J.; Malgaroli, M.; Mathis, W. S.; Rodriguez, C. I.; Selek, S.; Sharma, M. S.; Pittenger, C.; Yip, S. W.; Zaboski, B. A.; Xu, H.

2026-02-09 psychiatry and clinical psychology
10.64898/2026.02.03.26345402 medRxiv
Show abstract

ImportanceLarge language models (LLMs) have demonstrated diagnostic potential in several medical specialties, but their application to psychiatry - where diagnosis relies heavily on clinical judgment, narrative interpretation, and reasoning under uncertainty - remains insufficiently evaluated. ObjectiveTo evaluate diagnostic accuracy and clinician-judged reasoning quality of multiple large language models using psychiatric case vignettes. DesignMixed-methods evaluation study of diagnostic accuracy across four LLMs using 196 psychiatric case vignettes (135 published and 61 novel). Clinical reasoning quality was evaluated on a randomly selected subset of 30 vignettes using structured clinician ratings along two reasoning dimensions. The highest-performing model was illustratively compared with psychiatry trainees on the same subset. Diagnostic correctness for the full vignette set was assessed by a separate adjudicator LLM. SettingPublicly available model interfaces, December 2025. ParticipantsFive board-certified psychiatrists evaluated model-generated clinical reasoning. Two psychiatry residents served as the illustrative human comparison. Main Outcomes and MeasuresDiagnostic accuracy and clinician-rated clinical reasoning quality. Diagnostic accuracy was assessed using top-1 accuracy, top-5 accuracy, recall@5, and mean reciprocal rank based on ranked lists of five differential diagnoses per vignette. Clinical reasoning quality was assessed using two 5-point Likert scales adapted from the American Council of Graduate Medical Education Psychiatry Residency Milestones, evaluating data extraction and diagnostic reasoning. ResultsAcross 196 psychiatric case vignettes, Claude Opus 4.5 (Anthropic) achieved the highest diagnostic accuracy (top-1 accuracy, 0.638; top-5 accuracy, 0.801; recall@5, 0.731; mean reciprocal rank, 0.710) and clinician-rated reasoning scores. Higher clinician-rated diagnostic reasoning quality was strongly associated with diagnostic correctness in mixed-effects logistic regression analyses ({beta} = 1.80; p < 0.001), corresponding to an approximately six-fold increase in odds of a correct diagnosis per 1-point increase in reasoning score. In an illustrative comparison, diagnostic accuracy of Claude Opus 4.5 fell within the range observed for psychiatry trainees. Conclusions and RelevanceLLMs demonstrated high diagnostic accuracy and generated clinical reasoning that clinicians judged to be largely coherent and safe. Diagnostic reasoning quality was more strongly associated with diagnostic correctness than data extraction quality, underscoring the importance of evaluating reasoning alongside accuracy when assessing LLMs for clinical decision support in psychiatry. Key PointsO_ST_ABSQuestionC_ST_ABSCan multiple large language models accurately diagnose psychiatric conditions and generate diagnostic reasoning that clinicians judge as coherent, safe, and clinically meaningful? FindingsAcross 196 psychiatric case vignettes, four large language models demonstrated high diagnostic accuracy. In a clinician-evaluated subset of 30 vignettes, model diagnostic accuracy fell within the range observed for psychiatry residents. Clinicians judged model-generated diagnostic reasoning to be largely coherent and safe. Higher clinician-rated reasoning quality was strongly associated with diagnostic correctness, independent of data extraction quality. MeaningEvaluating diagnostic reasoning, in addition to accuracy, may be important when assessing large language models for potential clinical decision support in psychiatry.

Matching journals

The top 10 journals account for 50% of the predicted probability mass.

1
Psychiatry Research
35 papers in training set
Top 0.1%
10.0%
2
Acta Psychiatrica Scandinavica
10 papers in training set
Top 0.1%
7.1%
3
Schizophrenia Bulletin
29 papers in training set
Top 0.2%
7.1%
4
PLOS ONE
4510 papers in training set
Top 29%
6.2%
5
European Psychiatry
10 papers in training set
Top 0.1%
4.8%
6
BMJ Open
554 papers in training set
Top 5%
4.1%
7
BJPsych Open
25 papers in training set
Top 0.1%
3.9%
8
JAMA Network Open
127 papers in training set
Top 1%
3.2%
9
Frontiers in Psychiatry
83 papers in training set
Top 1%
3.0%
10
Psychological Medicine
74 papers in training set
Top 0.6%
3.0%
50% of probability mass above
11
The British Journal of Psychiatry
21 papers in training set
Top 0.3%
2.7%
12
Acta Neuropsychiatrica
12 papers in training set
Top 0.2%
2.6%
13
BMJ Mental Health
15 papers in training set
Top 0.1%
1.9%
14
Journal of Medical Internet Research
85 papers in training set
Top 2%
1.8%
15
Psychiatry and Clinical Neurosciences
11 papers in training set
Top 0.1%
1.7%
16
Translational Psychiatry
219 papers in training set
Top 3%
1.7%
17
Schizophrenia Research
29 papers in training set
Top 0.4%
1.6%
18
Journal of General Internal Medicine
20 papers in training set
Top 0.5%
1.6%
19
American Journal of Medical Genetics Part B: Neuropsychiatric Genetics
22 papers in training set
Top 0.2%
1.5%
20
npj Digital Medicine
97 papers in training set
Top 2%
1.5%
21
Computational Psychiatry
12 papers in training set
Top 0.1%
1.3%
22
JAMA Psychiatry
13 papers in training set
Top 0.3%
1.3%
23
Journal of Affective Disorders
81 papers in training set
Top 1%
1.3%
24
Frontiers in Artificial Intelligence
18 papers in training set
Top 0.5%
1.2%
25
BMC Medicine
163 papers in training set
Top 5%
1.2%
26
American Journal of Psychiatry
20 papers in training set
Top 0.4%
0.9%
27
JMIR Formative Research
32 papers in training set
Top 1%
0.9%
28
Genetics in Medicine
69 papers in training set
Top 0.9%
0.8%
29
Journal of Affective Disorders Reports
10 papers in training set
Top 0.3%
0.8%
30
BMC Psychiatry
22 papers in training set
Top 0.7%
0.8%