Back

A Chatbot for the Management of Bipolar Disorder: Using Retrieval-Augmented Generation with an Open-Weight Large Language Model to Answer Clinical Questions Based on the CANMAT and ISBD 2018 Guidelines

Mali, Y.; Zeng, Z.; Heo, K.; Zhang, G.; Chen, J.; Keramatian, K.; Saraf, G.; Solmi, M.; Tam, E.; Parikh, S.; Schaffer, A.; Beaulieu, S.; Ng, R.; Yatham, L. N.; Nunez, J.-J.

2025-12-02 psychiatry and clinical psychology
10.64898/2025.11.30.25341311 medRxiv
Show abstract

ObjectiveClinical practice guidelines support evidence-based care but are often underused due to complexity, time constraints, and navigation challenges. We investigated whether a conversational agent (chatbot) using an open-weight large language model (LLM) with retrieval-augmented generation (RAG) could provide guideline-consistent answers for bipolar disorder management based on the full 2018 CANMAT and ISBD guidelines, comparing against a system using only the base LLM. MethodWe developed a multi-step RAG-based chatbot that retrieves relevant guideline sections and generates responses using Llama 3.3 70B. Twenty-one clinical vignettes spanning all guideline sections were created. Six expert psychiatrists generated queries and were presented with paired responses without labels from two systems: one using the base Llama 3.3 70B model, the other RAG-enhanced. Responses rated guideline consistency on a three-point scale, and were analyzed using mixed-effects ordinal logistic regression. ResultsExperts evaluated 126 responses, of which 110 (87.3%) were rated as more or as correct as the baseline system. The RAG system produced 80 answers (63.5%) rated fully consistent with the guidelines versus 24 (19.0%) for baseline, and only 10 answers with major deviation (7.9%) versus 48 (38.1%) for baseline. Ordinal regression showed RAG responses were significantly more likely to be more correct (OR = 9.1, 95% CI 5.3-16.3, p < 0.001), which was consistent across all raters. Preference ratings favored RAG answers in 78.7% of cases. Performance varied by vignette, with some errors in both retrieval and reasoning. ConclusionThe use of RAG with an open-weight model helped produce answers consistent with the CANMAT guidelines across vignettes that required adapting or combining guideline text, suggesting viability of a bipolar guideline chatbot. We identified areas to improve results and evaluation. Future work should explore additional retrieval strategies and LLMs, and test in more naturalistic settings.

Matching journals

The top 8 journals account for 50% of the predicted probability mass.

1
Frontiers in Psychiatry
83 papers in training set
Top 0.4%
8.3%
2
Psychiatry Research
35 papers in training set
Top 0.1%
8.3%
3
Acta Psychiatrica Scandinavica
10 papers in training set
Top 0.1%
8.3%
4
Journal of Medical Internet Research
85 papers in training set
Top 0.5%
8.1%
5
Journal of Affective Disorders
81 papers in training set
Top 0.4%
6.2%
6
JMIR Formative Research
32 papers in training set
Top 0.2%
4.8%
7
PLOS ONE
4510 papers in training set
Top 34%
4.3%
8
European Psychiatry
10 papers in training set
Top 0.1%
3.9%
50% of probability mass above
9
Frontiers in Digital Health
20 papers in training set
Top 0.3%
3.5%
10
BJPsych Open
25 papers in training set
Top 0.2%
3.5%
11
BMJ Mental Health
15 papers in training set
Top 0.1%
3.2%
12
Acta Neuropsychiatrica
12 papers in training set
Top 0.3%
2.1%
13
BMJ Open
554 papers in training set
Top 8%
2.1%
14
Psychological Medicine
74 papers in training set
Top 0.9%
1.9%
15
npj Digital Medicine
97 papers in training set
Top 2%
1.6%
16
Journal of General Internal Medicine
20 papers in training set
Top 0.6%
1.5%
17
BMC Health Services Research
42 papers in training set
Top 1%
1.5%
18
Journal of Affective Disorders Reports
10 papers in training set
Top 0.1%
1.2%
19
Schizophrenia Bulletin
29 papers in training set
Top 0.5%
1.2%
20
JMIRx Med
31 papers in training set
Top 1%
1.2%
21
American Journal of Medical Genetics Part B: Neuropsychiatric Genetics
22 papers in training set
Top 0.3%
1.1%
22
JMIR Research Protocols
18 papers in training set
Top 1%
0.9%
23
Frontiers in Public Health
140 papers in training set
Top 7%
0.9%
24
Journal of Psychiatric Research
28 papers in training set
Top 0.6%
0.9%
25
BMC Psychiatry
22 papers in training set
Top 0.7%
0.8%
26
The British Journal of Psychiatry
21 papers in training set
Top 0.9%
0.8%
27
Nature Medicine
117 papers in training set
Top 4%
0.8%
28
Epidemiology and Psychiatric Sciences
10 papers in training set
Top 0.4%
0.7%
29
BMC Medical Informatics and Decision Making
39 papers in training set
Top 3%
0.7%
30
Computational Psychiatry
12 papers in training set
Top 0.2%
0.7%