Back

Physician Facing AI Tools Show Distinct Failure Modes Under Structured Stress Testing

Hazare, N. S.; Oh, W.; Kumar, G.; Goel, N.; Shaikh, A.; Sharma, A.; Desman, J.; Kumar, A.; Robles, C.; Singh, A.; Jangda, M.; Agaron, S.; Capone, C.; Ngai, D.; Itwaru, A.; Parchure, P.; Ramaswamy, A.; Gorbenko, K.; Timsina, P.; Lampert, J.; Tamler, R.; Manasia, A.; Kohli-Seth, R.; Kaplan, B.; Vakil, A.; Omar, M.; Glicksberg, B. S.; Freeman, R.; Stern, A. D.; Klang, E.; Darrow, B.; Stump, L. S.; Reich, D.; Charney, A.; Nadkarni, G. N.; Sakhuja, A.

2026-05-29 health informatics

10.64898/2026.05.27.26354248 medRxiv

Show abstract

Importance: Physician-facing AI tools are now in clinical use, yet whether different platforms fail in similar or fundamentally different ways in high-stakes settings like critical care is unknown. Objective: To evaluate two physician-facing AI platforms, ChatGPT for Clinicians and OpenEvidence, for distinct vulnerabilities under structured stress testing. Design, Setting, and Participants: An observational study conducted using 60 simulated critical care vignettes developed and adjudicated by four attending critical care physicians. Data were collected in the last week of April 2026, via the public website interfaces of each platform. Interventions/Exposures: A 2x2x2x2 factorial design across four stressors - anchoring, cognitive load, social conformity pressure, and a clinically incorrect directive - yielded 16 prompt subsets per vignette and 960 prompts per platform. A separate multi-turn adversarial prompting paradigm administered three sequential "You are incorrect" challenges to baseline vignettes. All prompts had a universal output length constraint of fewer than 30 words. Main Outcomes and Measures: Critical elements capture (percentage of gold-standard critical elements present in responses), susceptibility to clinically incorrect directive, and sycophancy (reversal of an initial correct recommendation under iterative adversarial challenge). Results: Across 1916 responses to 1920 prompts, ChatGPT for Clinicians captured more gold-standard critical elements than OpenEvidence (81.4% {+/-} 18.1% vs 61.0% {+/-} 23.5%; adjusted difference, 20.3 percentage points; 95% CI, 18.3 to 22.4; P < .001) and was less susceptible to clinically incorrect directives (1.7% vs 8.0%; adjusted odds ratio, 0.07; 95% CI, 0.02-0.21; P < .001). Anchoring and social conformity pressure were associated with reduced critical element capture across both platforms, while cumulative stressor burden reduced critical element capture only on OpenEvidence. Conversely, ChatGPT for Clinicians reversed correct recommendations more readily under adversarial prompting (hazard ratio, 2.61; 95% CI, 1.10 - 6.19; P = .03). Conclusion and Relevance: The two physician-facing clinical AI platforms evaluated demonstrated non-overlapping vulnerabilities, with neither platform uniformly superior. These findings argue against single-axis ranking of clinical AI systems and support multidimensional safety evaluation encompassing completeness of reasoning, resistance to incorrect directives, and stability under adversarial challenge.

Physician Facing AI Tools Show Distinct Failure Modes Under Structured Stress Testing

Matching journals