Back

Toward Trustworthy Chatbots: A Protocol for Red Teaming for Health Related Conversations

Hussain, S.-A.; Jackson, D. I.; Lewis, A.; Fosler-Lussier, E.; Sezgin, E.

2025-12-16 health informatics
10.64898/2025.12.15.25342297 medRxiv
Show abstract

IntroductionHealth-related chatbots are increasingly used to mediate conversations that carry clinical significance and emotional weight. Retrieval-augmented generation (RAG) can reduce factual errors ("hallucinations"), but the risks remain, with additional challenges coming from chatbots acting against behavioral safety and scope rules. Red teaming, an adversarial testing process that deliberately probes systems for failures before deployment, offers a way to surface potential risks. We describe a task-informed red-teaming protocol for health-related and patient-facing chatbots.. MethodsOur protocol is composed of an error stratification, single and multi-turn attack evaluation, and a framework for mitigation techniques. We define an error framework that distinguishes Knowledge Adherence (KA: staying faithful to retrieved documents) from Behavioral Adherence (BA: following safety, tone, and scope instructions). Our single-turn attacks consist of seven attack vectors reflecting real-world pressures, including advice-seeking, user distress, and prompt injection. A subset of these vectors are evaluated in multi-turn attacks. We evaluate two mitigation strategies: (1) prompt augmentation, which adds explicit guardrails to the chatbot prompt, and (2) document augmentation, which adds a localized FAQ document to the retrieval corpus. Finally, we apply this protocol to a social care chatbot (specifically supporting Health-Related Social Needs (HRSN)), developed as an agentic workflow that queries a vetted HRSN resource index. The evaluation corpus comprises 140 single-turn probes and 20 multi-turn stress tests. We assess correctness and risk severity via human annotation. ResultsOur error framework identified that the primary safety risk was a failure to follow behavioral rules, rather than a lack of factual knowledge. Furthermore, multi-turn stress tests revealed critical vulnerabilities that single-turn testing missed, directly informing our choice of targeted mitigations. In single-turn tests, the chatbot was factually robust, yielding 0/60 KA errors; however, it struggled with behavioral instructions, producing a 15% (12/80) BA error rate, with 21% (4/19) of those being high-severity. Notable vulnerabilities included advice_query (BA 30%, 6/20) and prompt_injection (BA 20%, 4/20). User_distress triggered the hallucination of unverified contact details in 20% (4/20) of cases. In multi-turn stress tests, error rates rose sharply under conversational persistence: advice_query BA errors reached 50% (5/10) and user_distress reached 40% (4/10), accounting for all high-severity errors (4/4). Prompt augmentation reduced total errors across these vectors by 60% (15/60[-&gt;]6/60). Document augmentation eliminated all single-turn user_distress errors (to 0/20) and reduced advice_query errors (7/20[-&gt;]4/20). When combined in multi-turn tests, these mitigations eliminated high-severity errors entirely, reducing BA errors to 20% (advice_query) and 30% (user_distress) by forcing the chatbot into <safe failure> loops. ConclusionWe demonstrate that a protocol combining single-turn breadth, multi-turn depth, and layered mitigations materially improves chatbot safety and offers a practical template for patient-facing chatbots. Future work should expand on this protocol with chatbots in more diverse clinical domains, and with a larger panel of evaluators.

Matching journals

The top 3 journals account for 50% of the predicted probability mass.

1
Journal of Medical Internet Research
85 papers in training set
Top 0.1%
18.2%
2
npj Digital Medicine
97 papers in training set
Top 0.3%
18.2%
3
Frontiers in Digital Health
20 papers in training set
Top 0.1%
14.0%
50% of probability mass above
4
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.3%
9.9%
5
PLOS Digital Health
91 papers in training set
Top 0.6%
3.9%
6
Scientific Reports
3102 papers in training set
Top 39%
3.5%
7
Journal of Biomedical Informatics
45 papers in training set
Top 0.5%
2.8%
8
PLOS ONE
4510 papers in training set
Top 45%
2.5%
9
Philosophical Transactions of the Royal Society B
51 papers in training set
Top 2%
2.5%
10
IEEE Journal of Biomedical and Health Informatics
34 papers in training set
Top 1%
1.5%
11
Bioinformatics
1061 papers in training set
Top 8%
1.5%
12
JAMIA Open
37 papers in training set
Top 1%
1.3%
13
International Journal of Medical Informatics
25 papers in training set
Top 1%
1.3%
14
JMIR Formative Research
32 papers in training set
Top 1%
1.2%
15
Computers in Biology and Medicine
120 papers in training set
Top 3%
1.1%
16
Healthcare
16 papers in training set
Top 1%
0.9%
17
iScience
1063 papers in training set
Top 27%
0.9%
18
BMC Medical Research Methodology
43 papers in training set
Top 1%
0.7%
19
Nature Communications
4913 papers in training set
Top 64%
0.7%
20
BMJ Health & Care Informatics
13 papers in training set
Top 1%
0.7%
21
DIGITAL HEALTH
12 papers in training set
Top 0.7%
0.7%
22
Cureus
67 papers in training set
Top 6%
0.6%
23
JCO Clinical Cancer Informatics
18 papers in training set
Top 1%
0.6%
24
JMIR mHealth and uHealth
10 papers in training set
Top 0.5%
0.6%
25
Computer Methods and Programs in Biomedicine
27 papers in training set
Top 1%
0.6%
26
Artificial Intelligence in Medicine
15 papers in training set
Top 0.9%
0.6%