Multi-Criteria Validation of LLM-Inferred Depression Severity from Outpatient Psychiatry Notes
Cudic, M.; Meyerson, W. U.; Wang, B.; Yin, Q.; Khadse, P. N.; Burke, T.; Kennedy, C. J.; Smoller, J. W.
Show abstract
BackgroundLongitudinal measurement of depression severity in outpatient psychiatric care is limited by infrequent standardized assessments. Although psychiatric clinical notes capture illness burden and functional impairment, this information is rarely quantified for analysis. ObjectiveTo evaluate whether large language models (LLMs) can infer clinically meaningful measures of depression severity from outpatient psychiatry notes. MethodsWe sampled 91,651 outpatient psychiatry notes from 8,287 adult patients across 58 clinics within a large academic medical center between 2015 and 2021. A HIPAA-compliant LLM (OpenAI GPT-5.2) was prompted to independently estimate three depression severity scores (Patient Health Questionnaire-9 [PHQ-9], Hamilton Depression Rating Scale [HAM-D], and depression-specific Clinical Global Impression-Severity [CGI-S]) from notes, with patient-reported PHQ-9 content within notes redacted to prevent biasing. Convergent validity was assessed against patient-reported PHQ-9 (n=3,757), study-clinician chart review (n=125), and treating-clinician suicide risk assessments (SRA; n=2,985). Predictive validity was evaluated using survival models of antidepressant switching and psychiatric emergency visits. Discriminant validity across diagnoses and consistency across demographic groups and clinics were also evaluated. Results10.8% of eligible visits had a PHQ-9 recorded within 7 days before the encounter. LLM-inferred PHQ-9 scores showed moderate agreement with patient-reported PHQ-9 (Cohens {kappa}=0.64, 95%CI:0.62-0.66; Pearson r=0.67, 95%CI: 0.65-0.68). Stronger agreement was found between LLM CGI-S and study-clinician chart review ({kappa}rater1=0.79, 95%CI: 0.70-0.85; {kappa}rater2=0.67, 95%CI: 0.58-0.77; r=0.86 with mean rating, 95%CI: 0.80-0.90). In prospective analyses, LLM CGI-S predicted antidepressant switching (C-index=0.60; CI95%: 0.58-0.62) and psychiatric emergency visits (C-index=0.63; 95%CI: 0.57-0.68), which was comparable to the predictive performance of patient-reported PHQ-9 and treating-clinician SRA. Correlations between LLM CGI-S and patient-reported PHQ-9 were consistent across clinics (I2<0.1) but significantly lower among Black (r=0.48, 95%CI: 0.38-0.57) and Hispanic (r=0.43, 95%CI: 0.27-0.56) patients. ConclusionsLLM-inferred depression severity scores from psychiatric outpatient notes support longitudinal, standardized phenotyping of depression severity, such as for routine outcome monitoring. These results have implications for facilitating genetic, pharmacoepidemiologic, and antidepressant treatment effectiveness studies using real-world evidence.
Matching journals
The top 11 journals account for 50% of the predicted probability mass.