Structured large language model extraction of clinical factors from electronic health record text supports scalable psychiatric severity prediction
Stephenson, C.; Camassa, A.; Wagner, M.; Shirazi, A. H.; Alavi, N.; Omrani, M.
Show abstract
BackgroundMental health systems face escalating demand that exceeds clinician capacity, making accurate severity-based triage a critical bottleneck. Severity assessment guides treatment intensity, resource allocation, and risk management, yet most clinically relevant information remains embedded in unstructured electronic health record (EHR) narratives, limiting its utility for scalable decision support. ObjectivesThis study evaluates whether a single large language model (LLM) can autonomously extract clinical factors from psychiatric EHR narratives, derive predictive weights from those factors, and use the resulting structured representation to predict clinician-implied severity at scale. MethodsFrom a Mayo Clinic repository of more than 2.7 million encounters, 15,000 de-identified psychiatric notes were sampled into a 5,000-patient discovery cohort and a 10,000-patient replication cohort. The same LLM (Llama 3 8B Instruct) extracted 17 background clinical factors and 3 treatment-action factors from each note. Severity reference labels were derived from the treatment-action factors using pre-specified clinical criteria. The LLM independently derived two factor-weight dictionaries from the discovery cohort: one capturing risk-oriented predictors of severe presentations and one capturing protective predictors. Five weighting conditions were then evaluated against the severity labels: the two LLM-derived dictionaries, two controls (LLM-derived variables with randomized weights; clinically irrelevant variables with arbitrary weights), and an unweighted zero-shot baseline. Performance was assessed across 928 valid iterations in the replication cohort. ResultsLLM-derived structured conditions significantly outperformed all controls and the baseline, with statistically equivalent performance between the two structured conditions. Improvements in precision and recall were balanced, indicating gains in discriminative capacity rather than threshold shifts. The variables and weights the LLM derived as predictors of severe presentations aligned closely with established clinical determinants of psychiatric severity. ConclusionA single LLM can derive clinically meaningful factor weights from unstructured EHR narratives and use them to predict psychiatric severity at scale, supporting a viable path toward interpretable, scalable triage in resource-constrained mental health systems.
Matching journals
The top 7 journals account for 50% of the predicted probability mass.