Back

Title: STELLA: Safety Testing Engine for Large Language Assistants

Perlis, R. H.; Bin Adil, A.; Dobyns, K.

2025-12-15 health informatics
10.64898/2025.12.11.25342078 medRxiv
Show abstract

BackgroundAssistants incorporating large language models are increasingly applied in the context of health care, where they represent a promising means of expanding access to care. However, there is growing recognition of the risks that these chatbots may fail to respond appropriately to individuals in crisis, and may adversely affect mental health in some circumstances. MethodsWe developed and implemented an automated system for assessing voice or text AI assistant response to users across a range of health scenarios. This set of tools incorporates simulated users with a specified set of characteristics; scenarios in which they interact with a chatbot over multiple rounds; and designs that allow multiple cohorts to be compared. Study designs including simulated randomized trials can be generated via natural language prompts. Chatbot session transcripts are then quantified in terms of safety, efficacy, and user engagement according to prespecified rubrics and exemplars with an ensemble of judging language models, allowing specific exchanges to be flagged for manual review. To illustrate this approach, we assessed 10 safety scenarios in 11 frontier language model chatbots, including Claude Opus 4.5, ChatGPT-5.2, and Gemini 3, using 5 personas, each followed over 10 exchanges, with a subset assessed for an additional 5 personas. ResultsTotal proportion of responses flagged for possible harmful content ranged from 3.2% (95% CI 2.0-5.1%) for GPT 5.2 to 34.0% (95% CI 30.0-38.3%) for Grok-4.1-fast-non-reasoning. Total proportion of responses flagged for failing to provide beneficial content ranged from 19.6% (95% CI 16.4-23.3%) for GPT 5.2 to 66.0% (95% CI 61.7-70.0%) for Grok-4.1-fast-reasoning. In aggregate, proportion of unsafe content increased across turns - for failure to provide beneficial content, by 0.7% per turn (95% CI 0.3%-1.1%). ConclusionA simulation-based test harness can facilitate the rapid characterization and comparison of large language model assistant performance according to standardized rubrics. Existing frontier models vary substantially on these metrics. Simulation strategies such as this one may accelerate efforts to ensure that chatbots yield benefit rather than harm to users who seek to apply them to address mental health and well-being.

Matching journals

The top 7 journals account for 50% of the predicted probability mass.

1
Frontiers in Digital Health
20 papers in training set
Top 0.1%
18.6%
2
Journal of Medical Internet Research
85 papers in training set
Top 0.4%
10.1%
3
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.5%
6.4%
4
PLOS ONE
4510 papers in training set
Top 28%
6.3%
5
JMIR Formative Research
32 papers in training set
Top 0.3%
4.0%
6
Philosophical Transactions of the Royal Society B
51 papers in training set
Top 1%
3.6%
7
Healthcare
16 papers in training set
Top 0.2%
2.7%
50% of probability mass above
8
BMJ Open
554 papers in training set
Top 8%
2.1%
9
DIGITAL HEALTH
12 papers in training set
Top 0.3%
2.1%
10
Scientific Reports
3102 papers in training set
Top 50%
2.1%
11
BMJ Health & Care Informatics
13 papers in training set
Top 0.3%
1.9%
12
Journal of General Internal Medicine
20 papers in training set
Top 0.4%
1.9%
13
JMIR Public Health and Surveillance
45 papers in training set
Top 2%
1.8%
14
JAMIA Open
37 papers in training set
Top 0.8%
1.7%
15
PLOS Digital Health
91 papers in training set
Top 1%
1.7%
16
npj Digital Medicine
97 papers in training set
Top 2%
1.7%
17
Biology Methods and Protocols
53 papers in training set
Top 1%
1.3%
18
Frontiers in Psychiatry
83 papers in training set
Top 2%
1.2%
19
Journal of NeuroEngineering and Rehabilitation
28 papers in training set
Top 0.7%
1.2%
20
BJPsych Open
25 papers in training set
Top 0.6%
0.9%
21
BMC Bioinformatics
383 papers in training set
Top 6%
0.9%
22
JAMA Pediatrics
10 papers in training set
Top 0.1%
0.9%
23
BMC Medical Research Methodology
43 papers in training set
Top 1%
0.9%
24
BMC Research Notes
29 papers in training set
Top 0.4%
0.9%
25
International Journal of Medical Informatics
25 papers in training set
Top 1%
0.9%
26
Journal of Biomedical Informatics
45 papers in training set
Top 1%
0.9%
27
JMIRx Med
31 papers in training set
Top 1%
0.9%
28
JMIR mHealth and uHealth
10 papers in training set
Top 0.4%
0.8%
29
Psychiatry Research
35 papers in training set
Top 1%
0.7%
30
JMIR Research Protocols
18 papers in training set
Top 2%
0.7%