Back

Multi-Criteria Validation of LLM-Inferred Depression Severity from Outpatient Psychiatry Notes

Cudic, M.; Meyerson, W. U.; Wang, B.; Yin, Q.; Khadse, P. N.; Burke, T.; Kennedy, C. J.; Smoller, J. W.

2026-03-12 psychiatry and clinical psychology
10.64898/2026.03.11.26348066 medRxiv
Show abstract

BackgroundLongitudinal measurement of depression severity in outpatient psychiatric care is limited by infrequent standardized assessments. Although psychiatric clinical notes capture illness burden and functional impairment, this information is rarely quantified for analysis. ObjectiveTo evaluate whether large language models (LLMs) can infer clinically meaningful measures of depression severity from outpatient psychiatry notes. MethodsWe sampled 91,651 outpatient psychiatry notes from 8,287 adult patients across 58 clinics within a large academic medical center between 2015 and 2021. A HIPAA-compliant LLM (OpenAI GPT-5.2) was prompted to independently estimate three depression severity scores (Patient Health Questionnaire-9 [PHQ-9], Hamilton Depression Rating Scale [HAM-D], and depression-specific Clinical Global Impression-Severity [CGI-S]) from notes, with patient-reported PHQ-9 content within notes redacted to prevent biasing. Convergent validity was assessed against patient-reported PHQ-9 (n=3,757), study-clinician chart review (n=125), and treating-clinician suicide risk assessments (SRA; n=2,985). Predictive validity was evaluated using survival models of antidepressant switching and psychiatric emergency visits. Discriminant validity across diagnoses and consistency across demographic groups and clinics were also evaluated. Results10.8% of eligible visits had a PHQ-9 recorded within 7 days before the encounter. LLM-inferred PHQ-9 scores showed moderate agreement with patient-reported PHQ-9 (Cohens {kappa}=0.64, 95%CI:0.62-0.66; Pearson r=0.67, 95%CI: 0.65-0.68). Stronger agreement was found between LLM CGI-S and study-clinician chart review ({kappa}rater1=0.79, 95%CI: 0.70-0.85; {kappa}rater2=0.67, 95%CI: 0.58-0.77; r=0.86 with mean rating, 95%CI: 0.80-0.90). In prospective analyses, LLM CGI-S predicted antidepressant switching (C-index=0.60; CI95%: 0.58-0.62) and psychiatric emergency visits (C-index=0.63; 95%CI: 0.57-0.68), which was comparable to the predictive performance of patient-reported PHQ-9 and treating-clinician SRA. Correlations between LLM CGI-S and patient-reported PHQ-9 were consistent across clinics (I2<0.1) but significantly lower among Black (r=0.48, 95%CI: 0.38-0.57) and Hispanic (r=0.43, 95%CI: 0.27-0.56) patients. ConclusionsLLM-inferred depression severity scores from psychiatric outpatient notes support longitudinal, standardized phenotyping of depression severity, such as for routine outcome monitoring. These results have implications for facilitating genetic, pharmacoepidemiologic, and antidepressant treatment effectiveness studies using real-world evidence.

Matching journals

The top 11 journals account for 50% of the predicted probability mass.

1
Psychological Medicine
74 papers in training set
Top 0.2%
6.6%
2
European Psychiatry
10 papers in training set
Top 0.1%
6.2%
3
Journal of Affective Disorders
81 papers in training set
Top 0.4%
6.1%
4
BJPsych Open
25 papers in training set
Top 0.1%
4.7%
5
Translational Psychiatry
219 papers in training set
Top 1%
4.7%
6
JAMA Network Open
127 papers in training set
Top 0.6%
4.7%
7
JAMA Psychiatry
13 papers in training set
Top 0.1%
4.7%
8
The British Journal of Psychiatry
21 papers in training set
Top 0.2%
4.2%
9
BMJ Mental Health
15 papers in training set
Top 0.1%
3.8%
10
Acta Psychiatrica Scandinavica
10 papers in training set
Top 0.1%
3.8%
11
Psychiatry Research
35 papers in training set
Top 0.4%
3.7%
50% of probability mass above
12
npj Digital Medicine
97 papers in training set
Top 1%
3.5%
13
Biological Psychiatry
119 papers in training set
Top 1.0%
3.5%
14
Molecular Psychiatry
242 papers in training set
Top 1%
3.5%
15
Schizophrenia Bulletin
29 papers in training set
Top 0.3%
3.5%
16
American Journal of Psychiatry
20 papers in training set
Top 0.1%
2.0%
17
BMC Medicine
163 papers in training set
Top 4%
1.6%
18
Acta Neuropsychiatrica
12 papers in training set
Top 0.5%
1.6%
19
PLOS ONE
4510 papers in training set
Top 56%
1.6%
20
PLOS Medicine
98 papers in training set
Top 3%
1.6%
21
Biological Psychiatry Global Open Science
54 papers in training set
Top 0.7%
1.6%
22
Frontiers in Psychiatry
83 papers in training set
Top 2%
1.3%
23
Biological Psychiatry: Cognitive Neuroscience and Neuroimaging
62 papers in training set
Top 1%
1.3%
24
BMJ Open
554 papers in training set
Top 11%
1.2%
25
Journal of Psychiatric Research
28 papers in training set
Top 0.6%
1.2%
26
Epidemiology and Psychiatric Sciences
10 papers in training set
Top 0.3%
0.9%
27
Nature Communications
4913 papers in training set
Top 60%
0.9%
28
Journal of Medical Internet Research
85 papers in training set
Top 4%
0.9%
29
European Neuropsychopharmacology
15 papers in training set
Top 0.6%
0.8%
30
International Journal of Neuropsychopharmacology
11 papers in training set
Top 0.2%
0.8%