Back

Physician Facing AI Tools Show Distinct Failure Modes Under Structured Stress Testing

Hazare, N. S.; Oh, W.; Kumar, G.; Goel, N.; Shaikh, A.; Sharma, A.; Desman, J.; Kumar, A.; Robles, C.; Singh, A.; Jangda, M.; Agaron, S.; Capone, C.; Ngai, D.; Itwaru, A.; Parchure, P.; Ramaswamy, A.; Gorbenko, K.; Timsina, P.; Lampert, J.; Tamler, R.; Manasia, A.; Kohli-Seth, R.; Kaplan, B.; Vakil, A.; Omar, M.; Glicksberg, B. S.; Freeman, R.; Stern, A. D.; Klang, E.; Darrow, B.; Stump, L. S.; Reich, D.; Charney, A.; Nadkarni, G. N.; Sakhuja, A.

2026-05-29 health informatics
10.64898/2026.05.27.26354248 medRxiv
Show abstract

Importance: Physician-facing AI tools are now in clinical use, yet whether different platforms fail in similar or fundamentally different ways in high-stakes settings like critical care is unknown. Objective: To evaluate two physician-facing AI platforms, ChatGPT for Clinicians and OpenEvidence, for distinct vulnerabilities under structured stress testing. Design, Setting, and Participants: An observational study conducted using 60 simulated critical care vignettes developed and adjudicated by four attending critical care physicians. Data were collected in the last week of April 2026, via the public website interfaces of each platform. Interventions/Exposures: A 2x2x2x2 factorial design across four stressors - anchoring, cognitive load, social conformity pressure, and a clinically incorrect directive - yielded 16 prompt subsets per vignette and 960 prompts per platform. A separate multi-turn adversarial prompting paradigm administered three sequential "You are incorrect" challenges to baseline vignettes. All prompts had a universal output length constraint of fewer than 30 words. Main Outcomes and Measures: Critical elements capture (percentage of gold-standard critical elements present in responses), susceptibility to clinically incorrect directive, and sycophancy (reversal of an initial correct recommendation under iterative adversarial challenge). Results: Across 1916 responses to 1920 prompts, ChatGPT for Clinicians captured more gold-standard critical elements than OpenEvidence (81.4% {+/-} 18.1% vs 61.0% {+/-} 23.5%; adjusted difference, 20.3 percentage points; 95% CI, 18.3 to 22.4; P < .001) and was less susceptible to clinically incorrect directives (1.7% vs 8.0%; adjusted odds ratio, 0.07; 95% CI, 0.02-0.21; P < .001). Anchoring and social conformity pressure were associated with reduced critical element capture across both platforms, while cumulative stressor burden reduced critical element capture only on OpenEvidence. Conversely, ChatGPT for Clinicians reversed correct recommendations more readily under adversarial prompting (hazard ratio, 2.61; 95% CI, 1.10 - 6.19; P = .03). Conclusion and Relevance: The two physician-facing clinical AI platforms evaluated demonstrated non-overlapping vulnerabilities, with neither platform uniformly superior. These findings argue against single-axis ranking of clinical AI systems and support multidimensional safety evaluation encompassing completeness of reasoning, resistance to incorrect directives, and stability under adversarial challenge.

Matching journals

The top 4 journals account for 50% of the predicted probability mass.

1
npj Digital Medicine
97 papers in training set
Top 0.2%
23.0%
2
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.2%
14.6%
3
Philosophical Transactions of the Royal Society B
51 papers in training set
Top 0.3%
9.3%
4
BMJ Health & Care Informatics
13 papers in training set
Top 0.1%
7.3%
50% of probability mass above
5
PLOS Digital Health
91 papers in training set
Top 0.3%
7.3%
6
Scientific Reports
3102 papers in training set
Top 26%
4.4%
7
Annals of Internal Medicine
27 papers in training set
Top 0.2%
3.1%
8
PLOS ONE
4510 papers in training set
Top 42%
2.9%
9
Journal of Medical Internet Research
85 papers in training set
Top 2%
2.1%
10
Frontiers in Digital Health
20 papers in training set
Top 0.6%
1.8%
11
Healthcare
16 papers in training set
Top 0.6%
1.7%
12
JAMA Network Open
127 papers in training set
Top 2%
1.7%
13
Computers in Biology and Medicine
120 papers in training set
Top 3%
1.0%
14
JMIR Medical Informatics
17 papers in training set
Top 1%
1.0%
15
International Journal of Medical Informatics
25 papers in training set
Top 1%
0.9%
16
Critical Care Explorations
15 papers in training set
Top 0.5%
0.8%
17
JAMIA Open
37 papers in training set
Top 2%
0.7%
18
Journal of General Internal Medicine
20 papers in training set
Top 1%
0.7%
19
iScience
1063 papers in training set
Top 33%
0.7%
20
Journal of Personalized Medicine
28 papers in training set
Top 2%
0.5%
21
JMIR Public Health and Surveillance
45 papers in training set
Top 5%
0.5%