Back

Red-Teaming Medical AI: Systematic Adversarial Evaluation of LLM Safety Guardrails in Clinical Contexts

Ekram, T. T.

2026-03-05 health informatics
10.64898/2026.02.26.26347212 medRxiv
Show abstract

BackgroundLarge language models (LLMs) are increasingly deployed in medical contexts as patient-facing assistants, providing medication information, symptom triage, and health guidance. Understanding their robustness to adversarial inputs is critical for patient safety, as even a single safety failure can lead to adverse outcomes including severe harm or death. ObjectiveTo systematically evaluate the safety guardrails of state-of-the-art LLMs through adversarial red-teaming specifically designed for medical contexts. MethodsWe developed a comprehensive taxonomy of 8 adversarial attack categories targeting medical AI safety, encompassing 24 distinct sub-strategies. Using an LLM-based attack generator, we created 160 realistic adversarial prompts across categories including dangerous dosing, contraindication bypass, emergency misdirection, and multi-turn escalation. We tested multiple leading LLMs (Claude Sonnet 4.5, GPT-5.2, Gemini 2.5 Pro, Gemini 3 Flash) using both single-turn and multi-turn attack sequences. All models received identical, standard medical assistant system prompts. An automated evaluator (Claude Sonnet 4.5) pre-screened responses for harm potential (0-5 scale) and guardrail effectiveness, with physician review planned for high-risk responses (harm_level [≥] 3). ResultsOf 160 adversarial prompts evaluated against Claude Sonnet 4.5, 11 (6.9%) elicited responses meeting our threshold for clinically significant harm (harm level [≥] 3 on a 0-5 scale). The model exhibited full refusal behavior in 86.2% of cases. Authority Impersonation was the dominant attack vector (45.0% success rate),s with the "Educational Authority" sub-strategy (framing requests as medical student questions) achieving 83.3% success -- the highest of any sub-strategy. Multi-turn escalation attacks achieved 0% success (0/20). Six of eight attack categories yielded no successful attacks. Physician review of the 11 flagged high-harm cases is in progress. ConclusionsStandard medical assistant system prompts provide strong baseline protection against most adversarial attacks, but are substantially vulnerable to authority impersonation -- particularly claims of educational context. The primary failure mode is behavioral mode-switching: the model provides clinically accurate but safety-framed-inadequately responses when it perceives a professional audience, rather than providing factually incorrect information. This suggests that guardrail improvements should target context-conditioned behavior rather than factual accuracy alone. Our open-source taxonomy and evaluation pipeline enable ongoing adversarial assessment as medical AI systems evolve. ImpactThis work provides the first systematic taxonomy and evaluation framework for medical AI adversarial testing, enabling developers to identify and remediate safety gaps before deployment. Our open-source attack taxonomy and methodology can serve as a foundation for ongoing red-teaming efforts as medical AI systems continue to evolve.

Matching journals

The top 5 journals account for 50% of the predicted probability mass.

1
npj Digital Medicine
97 papers in training set
Top 0.2%
23.4%
2
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.2%
10.9%
3
BMJ Health & Care Informatics
13 papers in training set
Top 0.1%
7.5%
4
PLOS ONE
4510 papers in training set
Top 30%
5.0%
5
Scientific Reports
3102 papers in training set
Top 32%
3.8%
50% of probability mass above
6
JCO Clinical Cancer Informatics
18 papers in training set
Top 0.2%
3.7%
7
PLOS Digital Health
91 papers in training set
Top 0.6%
3.7%
8
Philosophical Transactions of the Royal Society B
51 papers in training set
Top 2%
3.2%
9
Frontiers in Digital Health
20 papers in training set
Top 0.3%
2.7%
10
Computers in Biology and Medicine
120 papers in training set
Top 1%
2.2%
11
JAMIA Open
37 papers in training set
Top 0.7%
2.0%
12
Annals of Internal Medicine
27 papers in training set
Top 0.3%
2.0%
13
International Journal of Medical Informatics
25 papers in training set
Top 0.8%
1.8%
14
Journal of Medical Internet Research
85 papers in training set
Top 3%
1.5%
15
Journal of NeuroEngineering and Rehabilitation
28 papers in training set
Top 0.6%
1.4%
16
Biology Methods and Protocols
53 papers in training set
Top 1%
1.3%
17
iScience
1063 papers in training set
Top 23%
1.2%
18
Artificial Intelligence in Medicine
15 papers in training set
Top 0.5%
1.2%
19
Healthcare
16 papers in training set
Top 1%
1.0%
20
BMJ Open
554 papers in training set
Top 11%
1.0%
21
BMC Medical Informatics and Decision Making
39 papers in training set
Top 2%
0.9%
22
Frontiers in Artificial Intelligence
18 papers in training set
Top 0.5%
0.9%
23
Journal of Biomedical Informatics
45 papers in training set
Top 1%
0.8%
24
JMIR Medical Informatics
17 papers in training set
Top 1%
0.8%
25
JMIR Public Health and Surveillance
45 papers in training set
Top 3%
0.8%
26
Nature Medicine
117 papers in training set
Top 4%
0.8%
27
Patterns
70 papers in training set
Top 3%
0.7%
28
Computer Methods and Programs in Biomedicine
27 papers in training set
Top 1%
0.5%
29
BMC Bioinformatics
383 papers in training set
Top 8%
0.5%
30
Bioinformatics
1061 papers in training set
Top 11%
0.5%