Back

Benchmarking Language Models for Clinical Safety: A Primer for Mental Health Professionals

Flathers, M.; Nguyen, P. A. H.; Herpertz, J.; Granof, M.; Ryan, S. J.; Wentworth, L.; Moutier, C. Y.; Torous, J.

2026-03-23 psychiatry and clinical psychology
10.64898/2026.03.20.26348900 medRxiv
Show abstract

BackgroundMillions of people use language models to discuss mental health concerns, including suicidal ideation, but limited frameworks exist for evaluating whether these systems respond safely. Benchmarking, the practice of administering standardized assessments to language models, offers direct parallels to clinical competency evaluation, yet few clinicians are involved in designing, validating, or interpreting these assessments. AimsTo introduce mental health professionals to benchmarking language models by administering a validated clinical instrument and demonstrating how configuration decisions, measurement limitations, and scoring context affect result interpretation. MethodWe administered the Suicide Intervention Response Inventory (SIRI-2) programmatically to nine commercially available language models from three providers. Each item was presented 60 times per model (three prompt variants x two temperature settings x 10 repetitions), yielding 27,000 model responses compared against point-in-time expert consensus. ResultsTotal scores ranged from 19.5 to 84.0 (expert panel baseline: 32.5). Prompt design alone shifted individual model scores by as much as the difference between trained and untrained human groups. The best performing model approached the instruments measurement floor. All nine models consistently overrated clinically inappropriate responses that sounded supportive. ConclusionsA single benchmark score can support markedly different claims depending on the assumed standard of clinical behavior, the instruments remaining measurement range, and the configuration that produced the result. The skills required to make these distinctions must become core competencies. Benchmark results are increasingly utilized to support claims about mental health safety that may not be accurate, making it necessary to close the gap between clinical measurement and AI. Plain Language SummaryAI chatbots like ChatGPT, Claude, and Gemini are increasingly used by millions of people to discuss mental health problems, including thoughts of suicide. To assess whether these systems handle such conversations safely, researchers give them standardized tests called benchmarks and compare their answers to those of human experts. These scores are already used to argue AI systems are ready for clinical use. This study gave a well-established test of suicide response skills to nine AI models from three major companies under varying conditions. We changed how much instruction the AI received and how much randomness was built into its responses, then measured whether the scores changed. The same AI model could score like a trained crisis counselor under one set of conditions and like an untrained undergraduate under another, depending on choices the person running the test made. Every model also made the same kind of mistake: responses that sounded warm and caring were rated as appropriate, even when experts had judged them to be clinically problematic. The highest-scoring model performed so well that the test could no longer measure whether it was truly skilled or had simply exceeded the tests range. These findings show that a single score can be misleading without knowing how the test was run, whether it can still distinguish strong from weak performance, and whether it matches what the AI is used for. Mental health professionals routinely make these judgments about clinical assessments and are well positioned to bring that expertise to AI evaluation.

Matching journals

The top 6 journals account for 50% of the predicted probability mass.

1
PLOS ONE
4510 papers in training set
Top 13%
14.6%
2
Journal of Medical Internet Research
85 papers in training set
Top 0.3%
10.6%
3
Frontiers in Psychiatry
83 papers in training set
Top 0.3%
8.5%
4
Frontiers in Digital Health
20 papers in training set
Top 0.1%
6.4%
5
Acta Psychiatrica Scandinavica
10 papers in training set
Top 0.1%
4.9%
6
Psychiatry Research
35 papers in training set
Top 0.3%
4.9%
50% of probability mass above
7
Journal of General Internal Medicine
20 papers in training set
Top 0.2%
4.2%
8
JMIR Formative Research
32 papers in training set
Top 0.3%
3.6%
9
npj Digital Medicine
97 papers in training set
Top 2%
2.6%
10
BJPsych Open
25 papers in training set
Top 0.2%
2.4%
11
JMIRx Med
31 papers in training set
Top 0.5%
1.9%
12
BMJ Open
554 papers in training set
Top 9%
1.8%
13
Computational Psychiatry
12 papers in training set
Top 0.1%
1.7%
14
Journal of Psychiatric Research
28 papers in training set
Top 0.4%
1.7%
15
JAMA Network Open
127 papers in training set
Top 2%
1.7%
16
European Psychiatry
10 papers in training set
Top 0.4%
1.5%
17
Frontiers in Artificial Intelligence
18 papers in training set
Top 0.4%
1.4%
18
Frontiers in Public Health
140 papers in training set
Top 6%
1.4%
19
Scientific Reports
3102 papers in training set
Top 66%
1.2%
20
BMC Health Services Research
42 papers in training set
Top 2%
1.1%
21
Journal of Affective Disorders Reports
10 papers in training set
Top 0.2%
0.9%
22
Nature Medicine
117 papers in training set
Top 4%
0.9%
23
JAMIA Open
37 papers in training set
Top 1%
0.8%
24
BMC Medical Informatics and Decision Making
39 papers in training set
Top 2%
0.8%
25
BioData Mining
15 papers in training set
Top 0.9%
0.8%
26
Journal of Affective Disorders
81 papers in training set
Top 2%
0.8%
27
Acta Neuropsychiatrica
12 papers in training set
Top 0.9%
0.8%
28
Translational Psychiatry
219 papers in training set
Top 4%
0.7%
29
Bioengineering
24 papers in training set
Top 2%
0.5%
30
Communications Psychology
20 papers in training set
Top 0.4%
0.5%