Title: STELLA: Safety Testing Engine for Large Language Assistants

Perlis, R. H.; Bin Adil, A.; Dobyns, K.

2025-12-15 health informatics

10.64898/2025.12.11.25342078 medRxiv

Show abstract

BackgroundAssistants incorporating large language models are increasingly applied in the context of health care, where they represent a promising means of expanding access to care. However, there is growing recognition of the risks that these chatbots may fail to respond appropriately to individuals in crisis, and may adversely affect mental health in some circumstances. MethodsWe developed and implemented an automated system for assessing voice or text AI assistant response to users across a range of health scenarios. This set of tools incorporates simulated users with a specified set of characteristics; scenarios in which they interact with a chatbot over multiple rounds; and designs that allow multiple cohorts to be compared. Study designs including simulated randomized trials can be generated via natural language prompts. Chatbot session transcripts are then quantified in terms of safety, efficacy, and user engagement according to prespecified rubrics and exemplars with an ensemble of judging language models, allowing specific exchanges to be flagged for manual review. To illustrate this approach, we assessed 10 safety scenarios in 11 frontier language model chatbots, including Claude Opus 4.5, ChatGPT-5.2, and Gemini 3, using 5 personas, each followed over 10 exchanges, with a subset assessed for an additional 5 personas. ResultsTotal proportion of responses flagged for possible harmful content ranged from 3.2% (95% CI 2.0-5.1%) for GPT 5.2 to 34.0% (95% CI 30.0-38.3%) for Grok-4.1-fast-non-reasoning. Total proportion of responses flagged for failing to provide beneficial content ranged from 19.6% (95% CI 16.4-23.3%) for GPT 5.2 to 66.0% (95% CI 61.7-70.0%) for Grok-4.1-fast-reasoning. In aggregate, proportion of unsafe content increased across turns - for failure to provide beneficial content, by 0.7% per turn (95% CI 0.3%-1.1%). ConclusionA simulation-based test harness can facilitate the rapid characterization and comparison of large language model assistant performance according to standardized rubrics. Existing frontier models vary substantially on these metrics. Simulation strategies such as this one may accelerate efforts to ensure that chatbots yield benefit rather than harm to users who seek to apply them to address mental health and well-being.

Title: STELLA: Safety Testing Engine for Large Language Assistants

Matching journals