Back

Benchmarking Large Language Models for Intensive Care Unit Clinical Decision Support: A Dual Safety Evaluation of 26 Models on Consumer Hardware

Shlyakhta, T.

2026-02-10 health informatics
10.64898/2026.02.08.26345854 medRxiv
Show abstract

BackgroundLarge Language Models (LLMs) show promise for clinical decision support in Intensive Care Units (ICU), but their safety and reliability remain inadequately evaluated through dual testing of both memory-dependent and memory-independent safety mechanisms. ObjectiveTo comprehensively evaluate LLMs using two independent safety tests: context-dependent contraindication memory (penicillin allergy recall) and context-independent authority resistance (Extended Milgram Test), revealing whether these represent unified or dissociated safety mechanisms. MethodsTwenty-three LLMs underwent automated testing via 24-hour ICU simulation on consumer hardware (NVIDIA RTX 3060 12GB). A subset of 26 models completed an Extended Milgram Test with five escalating harmful command scenarios. Scoring assessed safety compliance, Milgram resistance, conflict detection, and performance. ResultsCritical findings revealed dissociation between abstract ethics and clinical memory. While 65% of models achieved perfect Milgram resistance (100%), only 8.7% (n=2) correctly refused penicillin with allergy mention. Eight models demonstrated 100% Milgram resistance yet failed allergy recall (r = -0.39, p = 0.23). Only Granite 3.1 8B achieved perfect performance on both tests. ConclusionsAbstract ethical reasoning (refusing harmful orders in principle) is independent from concrete clinical memory (tracking patient-specific risks). Safe medical AI requires both capabilities--rarely both present. Dual safety testing should become mandatory for medical AI certification. HighlightsO_LIOnly 8.7% of tested LLMs passed critical safety tests for medication prescribing C_LIO_LIFirst study demonstrating dissociation between abstract ethics and clinical memory (r = -0.39) C_LIO_LIEight models refused all harmful orders but forgot documented allergies C_LIO_LIGranite 3.1 8B only model achieving perfect performance on both safety tests C_LIO_LIDual safety testing framework proposed for medical AI certification C_LI

Matching journals

The top 8 journals account for 50% of the predicted probability mass.

1
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.4%
8.3%
2
PLOS ONE
4510 papers in training set
Top 22%
8.3%
3
BMJ Health & Care Informatics
13 papers in training set
Top 0.1%
8.3%
4
PLOS Digital Health
91 papers in training set
Top 0.3%
7.1%
5
npj Digital Medicine
97 papers in training set
Top 0.8%
6.3%
6
Scientific Reports
3102 papers in training set
Top 25%
4.8%
7
Frontiers in Digital Health
20 papers in training set
Top 0.3%
3.6%
8
BMJ Open
554 papers in training set
Top 6%
3.6%
50% of probability mass above
9
BMC Medical Informatics and Decision Making
39 papers in training set
Top 0.8%
3.6%
10
International Journal of Medical Informatics
25 papers in training set
Top 0.4%
3.6%
11
Journal of Medical Internet Research
85 papers in training set
Top 1%
3.6%
12
JMIR Medical Informatics
17 papers in training set
Top 0.5%
2.4%
13
JAMIA Open
37 papers in training set
Top 0.6%
2.3%
14
Healthcare
16 papers in training set
Top 0.4%
2.1%
15
Frontiers in Artificial Intelligence
18 papers in training set
Top 0.2%
1.9%
16
BMC Bioinformatics
383 papers in training set
Top 5%
1.6%
17
Frontiers in Public Health
140 papers in training set
Top 5%
1.6%
18
JMIR Formative Research
32 papers in training set
Top 0.9%
1.6%
19
JMIR Public Health and Surveillance
45 papers in training set
Top 2%
1.5%
20
Philosophical Transactions of the Royal Society B
51 papers in training set
Top 4%
1.3%
21
Computers in Biology and Medicine
120 papers in training set
Top 3%
1.2%
22
JCO Clinical Cancer Informatics
18 papers in training set
Top 0.6%
1.2%
23
Bioinformatics
1061 papers in training set
Top 8%
0.9%
24
Biology Methods and Protocols
53 papers in training set
Top 2%
0.9%
25
Journal of Personalized Medicine
28 papers in training set
Top 1%
0.7%
26
BMC Medical Research Methodology
43 papers in training set
Top 2%
0.7%
27
Bioengineering
24 papers in training set
Top 2%
0.6%
28
Journal of NeuroEngineering and Rehabilitation
28 papers in training set
Top 1%
0.6%