Diagnostic Accuracy and Clinical Reasoning of Multiple Large Language Models in Psychiatry
Jin, K. W.; Rostam-Abadi, Y.; Chaudhary, P.; Garrett, M. A.; Huang, A. S.; Montelongo, M.; Nagpal, C.; Shei, J.; Weathers, J.; Zhang, J. S.; Chen, Q.; Kim, J.; Malgaroli, M.; Mathis, W. S.; Rodriguez, C. I.; Selek, S.; Sharma, M. S.; Pittenger, C.; Yip, S. W.; Zaboski, B. A.; Xu, H.
Show abstract
ImportanceLarge language models (LLMs) have demonstrated diagnostic potential in several medical specialties, but their application to psychiatry - where diagnosis relies heavily on clinical judgment, narrative interpretation, and reasoning under uncertainty - remains insufficiently evaluated. ObjectiveTo evaluate diagnostic accuracy and clinician-judged reasoning quality of multiple large language models using psychiatric case vignettes. DesignMixed-methods evaluation study of diagnostic accuracy across four LLMs using 196 psychiatric case vignettes (135 published and 61 novel). Clinical reasoning quality was evaluated on a randomly selected subset of 30 vignettes using structured clinician ratings along two reasoning dimensions. The highest-performing model was illustratively compared with psychiatry trainees on the same subset. Diagnostic correctness for the full vignette set was assessed by a separate adjudicator LLM. SettingPublicly available model interfaces, December 2025. ParticipantsFive board-certified psychiatrists evaluated model-generated clinical reasoning. Two psychiatry residents served as the illustrative human comparison. Main Outcomes and MeasuresDiagnostic accuracy and clinician-rated clinical reasoning quality. Diagnostic accuracy was assessed using top-1 accuracy, top-5 accuracy, recall@5, and mean reciprocal rank based on ranked lists of five differential diagnoses per vignette. Clinical reasoning quality was assessed using two 5-point Likert scales adapted from the American Council of Graduate Medical Education Psychiatry Residency Milestones, evaluating data extraction and diagnostic reasoning. ResultsAcross 196 psychiatric case vignettes, Claude Opus 4.5 (Anthropic) achieved the highest diagnostic accuracy (top-1 accuracy, 0.638; top-5 accuracy, 0.801; recall@5, 0.731; mean reciprocal rank, 0.710) and clinician-rated reasoning scores. Higher clinician-rated diagnostic reasoning quality was strongly associated with diagnostic correctness in mixed-effects logistic regression analyses ({beta} = 1.80; p < 0.001), corresponding to an approximately six-fold increase in odds of a correct diagnosis per 1-point increase in reasoning score. In an illustrative comparison, diagnostic accuracy of Claude Opus 4.5 fell within the range observed for psychiatry trainees. Conclusions and RelevanceLLMs demonstrated high diagnostic accuracy and generated clinical reasoning that clinicians judged to be largely coherent and safe. Diagnostic reasoning quality was more strongly associated with diagnostic correctness than data extraction quality, underscoring the importance of evaluating reasoning alongside accuracy when assessing LLMs for clinical decision support in psychiatry. Key PointsO_ST_ABSQuestionC_ST_ABSCan multiple large language models accurately diagnose psychiatric conditions and generate diagnostic reasoning that clinicians judge as coherent, safe, and clinically meaningful? FindingsAcross 196 psychiatric case vignettes, four large language models demonstrated high diagnostic accuracy. In a clinician-evaluated subset of 30 vignettes, model diagnostic accuracy fell within the range observed for psychiatry residents. Clinicians judged model-generated diagnostic reasoning to be largely coherent and safe. Higher clinician-rated reasoning quality was strongly associated with diagnostic correctness, independent of data extraction quality. MeaningEvaluating diagnostic reasoning, in addition to accuracy, may be important when assessing large language models for potential clinical decision support in psychiatry.
Matching journals
The top 10 journals account for 50% of the predicted probability mass.