A Chatbot for the Management of Bipolar Disorder: Using Retrieval-Augmented Generation with an Open-Weight Large Language Model to Answer Clinical Questions Based on the CANMAT and ISBD 2018 Guidelines
Mali, Y.; Zeng, Z.; Heo, K.; Zhang, G.; Chen, J.; Keramatian, K.; Saraf, G.; Solmi, M.; Tam, E.; Parikh, S.; Schaffer, A.; Beaulieu, S.; Ng, R.; Yatham, L. N.; Nunez, J.-J.
Show abstract
ObjectiveClinical practice guidelines support evidence-based care but are often underused due to complexity, time constraints, and navigation challenges. We investigated whether a conversational agent (chatbot) using an open-weight large language model (LLM) with retrieval-augmented generation (RAG) could provide guideline-consistent answers for bipolar disorder management based on the full 2018 CANMAT and ISBD guidelines, comparing against a system using only the base LLM. MethodWe developed a multi-step RAG-based chatbot that retrieves relevant guideline sections and generates responses using Llama 3.3 70B. Twenty-one clinical vignettes spanning all guideline sections were created. Six expert psychiatrists generated queries and were presented with paired responses without labels from two systems: one using the base Llama 3.3 70B model, the other RAG-enhanced. Responses rated guideline consistency on a three-point scale, and were analyzed using mixed-effects ordinal logistic regression. ResultsExperts evaluated 126 responses, of which 110 (87.3%) were rated as more or as correct as the baseline system. The RAG system produced 80 answers (63.5%) rated fully consistent with the guidelines versus 24 (19.0%) for baseline, and only 10 answers with major deviation (7.9%) versus 48 (38.1%) for baseline. Ordinal regression showed RAG responses were significantly more likely to be more correct (OR = 9.1, 95% CI 5.3-16.3, p < 0.001), which was consistent across all raters. Preference ratings favored RAG answers in 78.7% of cases. Performance varied by vignette, with some errors in both retrieval and reasoning. ConclusionThe use of RAG with an open-weight model helped produce answers consistent with the CANMAT guidelines across vignettes that required adapting or combining guideline text, suggesting viability of a bipolar guideline chatbot. We identified areas to improve results and evaluation. Future work should explore additional retrieval strategies and LLMs, and test in more naturalistic settings.
Matching journals
The top 8 journals account for 50% of the predicted probability mass.