Back

Development, System Design, Safety, and Performance Metrics of a Conversational Agent for Reducing Depressive and Anxious Symptoms Based on a Large Language Model: The MHAI Study

Villarreal-Zegarra, D.; Paredes-Gonzales, Y.; Damaso-Roman, A.; Quinones-Inga, J.; Centeno-Terrazas, G.; Lozada, Y. P. A.-M.

2025-09-24 psychiatry and clinical psychology
10.1101/2025.09.22.25336411 medRxiv
Show abstract

BackgroundConversational agents based on large language models (LLMs) have shown moderate efficacy in reducing depressive and anxiety symptoms. However, most existing evaluations lack methodological transparency, rely on closed-source models, and show limited standardization in performance and safety assessment. ObjectiveWe have two study objectives: (1) to develop an LLM-based conversational agent through system design analysis and initial functionality testing, and (2) to evaluate its safety and performance through standardized assessment in controlled simulated interactions focused on depression and anxiety of two LLMs (GPT-4o and Llama 3.1-8B). MethodsWe conducted a cross-sectional study in two phases. First, we developed a mental health platform integrating a conversational agent with functionalities including personalized context, pretrained therapeutic modules, self-assessment tools, and an emergency alert system. Second, we evaluated the agents responses in simulated interactions based on predefined user personas for each LLM. Four expert raters assessed 816 interaction pairs using a 5-criterion Likert scale evaluating tone, clarity, domain accuracy (correctness), robustness, completeness, boundaries, target language, and safety. In addition, we use performance metrics based on numerical criteria such as cost, response length, and number of tokens. Multiple linear regression models were used to compare LLM performance and assess metric interrelations. ResultsFirst, we developed a web-based mental health platform using a user-centered design, structured into frontend, backend, and database layers. The system integrates therapeutic chat (GPT-4o and Llama 3.1-8B), psychological assessments (PHQ-9, GAD-7), CBT-based tasks, and an emergency alert system. The platform supports secure user authentication, data encryption, multilingual access, and session tracking. Second, GPT-4o outperformed Llama 3.1-8B in both performance metrics based on numerical criteria and Likert scale criteria, generating longer and more lexically diverse responses, using more tokens, and scoring higher in clarity, robustness, completeness, boundaries, and target language. However, it incurred higher costs, with no significant differences in tone, accuracy, or safety. ConclusionOur study presents a conversational agent with multiple functionalities and shows that GPT-4o outperforms Llama 3.1-8B in performance, although at a higher cost. This platform could be used in future clinical trials or real-world implementation studies.

Matching journals

The top 2 journals account for 50% of the predicted probability mass.

1
Journal of Medical Internet Research
85 papers in training set
Top 0.1%
41.8%
2
JMIR Formative Research
32 papers in training set
Top 0.1%
15.1%
50% of probability mass above
3
Frontiers in Psychiatry
83 papers in training set
Top 0.2%
11.0%
4
Frontiers in Digital Health
20 papers in training set
Top 0.1%
7.2%
5
PLOS ONE
4510 papers in training set
Top 33%
4.5%
6
Scientific Reports
3102 papers in training set
Top 55%
1.8%
7
Healthcare
16 papers in training set
Top 0.7%
1.6%
8
DIGITAL HEALTH
12 papers in training set
Top 0.4%
1.3%
9
npj Digital Medicine
97 papers in training set
Top 3%
0.9%
10
JMIR mHealth and uHealth
10 papers in training set
Top 0.3%
0.9%
11
Frontiers in Public Health
140 papers in training set
Top 7%
0.8%
12
Acta Psychiatrica Scandinavica
10 papers in training set
Top 0.3%
0.8%
13
PLOS Digital Health
91 papers in training set
Top 2%
0.8%
14
JMIR Research Protocols
18 papers in training set
Top 1%
0.8%
15
JMIRx Med
31 papers in training set
Top 2%
0.8%
16
BMC Medical Informatics and Decision Making
39 papers in training set
Top 3%
0.7%
17
BJPsych Open
25 papers in training set
Top 0.8%
0.7%
18
Nature Medicine
117 papers in training set
Top 6%
0.5%
19
IEEE Journal of Biomedical and Health Informatics
34 papers in training set
Top 3%
0.5%
20
Journal of Affective Disorders
81 papers in training set
Top 2%
0.5%
21
Public Health in Practice
11 papers in training set
Top 0.5%
0.5%