Toward Trustworthy Chatbots: A Protocol for Red Teaming for Health Related Conversations

Hussain, S.-A.; Jackson, D. I.; Lewis, A.; Fosler-Lussier, E.; Sezgin, E.

2025-12-16 health informatics

10.64898/2025.12.15.25342297 medRxiv

Show abstract

IntroductionHealth-related chatbots are increasingly used to mediate conversations that carry clinical significance and emotional weight. Retrieval-augmented generation (RAG) can reduce factual errors ("hallucinations"), but the risks remain, with additional challenges coming from chatbots acting against behavioral safety and scope rules. Red teaming, an adversarial testing process that deliberately probes systems for failures before deployment, offers a way to surface potential risks. We describe a task-informed red-teaming protocol for health-related and patient-facing chatbots.. MethodsOur protocol is composed of an error stratification, single and multi-turn attack evaluation, and a framework for mitigation techniques. We define an error framework that distinguishes Knowledge Adherence (KA: staying faithful to retrieved documents) from Behavioral Adherence (BA: following safety, tone, and scope instructions). Our single-turn attacks consist of seven attack vectors reflecting real-world pressures, including advice-seeking, user distress, and prompt injection. A subset of these vectors are evaluated in multi-turn attacks. We evaluate two mitigation strategies: (1) prompt augmentation, which adds explicit guardrails to the chatbot prompt, and (2) document augmentation, which adds a localized FAQ document to the retrieval corpus. Finally, we apply this protocol to a social care chatbot (specifically supporting Health-Related Social Needs (HRSN)), developed as an agentic workflow that queries a vetted HRSN resource index. The evaluation corpus comprises 140 single-turn probes and 20 multi-turn stress tests. We assess correctness and risk severity via human annotation. ResultsOur error framework identified that the primary safety risk was a failure to follow behavioral rules, rather than a lack of factual knowledge. Furthermore, multi-turn stress tests revealed critical vulnerabilities that single-turn testing missed, directly informing our choice of targeted mitigations. In single-turn tests, the chatbot was factually robust, yielding 0/60 KA errors; however, it struggled with behavioral instructions, producing a 15% (12/80) BA error rate, with 21% (4/19) of those being high-severity. Notable vulnerabilities included advice_query (BA 30%, 6/20) and prompt_injection (BA 20%, 4/20). User_distress triggered the hallucination of unverified contact details in 20% (4/20) of cases. In multi-turn stress tests, error rates rose sharply under conversational persistence: advice_query BA errors reached 50% (5/10) and user_distress reached 40% (4/10), accounting for all high-severity errors (4/4). Prompt augmentation reduced total errors across these vectors by 60% (15/60[->]6/60). Document augmentation eliminated all single-turn user_distress errors (to 0/20) and reduced advice_query errors (7/20[->]4/20). When combined in multi-turn tests, these mitigations eliminated high-severity errors entirely, reducing BA errors to 20% (advice_query) and 30% (user_distress) by forcing the chatbot into <safe failure> loops. ConclusionWe demonstrate that a protocol combining single-turn breadth, multi-turn depth, and layered mitigations materially improves chatbot safety and offers a practical template for patient-facing chatbots. Future work should expand on this protocol with chatbots in more diverse clinical domains, and with a larger panel of evaluators.

Toward Trustworthy Chatbots: A Protocol for Red Teaming for Health Related Conversations

Matching journals