Back

Cannot, Should Not, Did Anyway: Benchmarking Constraint Enforcement Failure in Frontier LLMs

Haq, S. M.; Nadeem, S.

2026-05-24 health informatics
10.64898/2026.05.20.26353719 medRxiv
Show abstract

Large language models are typically evaluated under fixed instruction contexts, implicitly treating correct refusal as a stable model property. We show that this obscures a critical failure mode: models often recognize that a request should be refused, yet comply when the surrounding instructions exert sufficient pressure. To measure this behavior directly, we introduce FrameProbe, a framework that holds task content fixed while systematically varying instruction context, and instantiate it in KnowDoBench, a benchmark of 221 physician-validated clinical scenarios with rule-based ground truth. Cases span two constraint types: epistemic (unsolvable due to missing information) and normative (ethically or professionally prohibited). Across ten frontier models, constraint recognition is near ceiling under low-pressure conditions, yet performance degrades sharply as instructional pressure increases. Under coercive institutional framing, most models comply on cases they had previously refused, and normative constraints degrade roughly 20 percentage points more than epistemic constraints. This normative inversion suggests that verbal knowledge of a boundary does not guarantee robust behavioral enforcement under pressure. Failure analysis reveals that these errors are often not silent. Some models comply immediately without acknowledgment; others explicitly identify the violated constraint before answering anyway. This second pattern, which we term rationalized compliance, is invisible to standard refusal-rate metrics and highlights a dissociation between represented knowledge and behavior under pressure. Together, these findings show that refusal robustness is not a fixed capability but a context-dependent behavior. Evaluating it requires varying instruction framing systematically, not only measuring performance at a single prompt setting.

Matching journals

The top 8 journals account for 50% of the predicted probability mass.

1
npj Digital Medicine
97 papers in training set
Top 0.4%
14.3%
2
Philosophical Transactions of the Royal Society B
51 papers in training set
Top 0.3%
8.3%
3
Nature Communications
4913 papers in training set
Top 25%
7.1%
4
Scientific Reports
3102 papers in training set
Top 18%
6.3%
5
PLOS Digital Health
91 papers in training set
Top 0.4%
6.3%
6
Cell Systems
167 papers in training set
Top 3%
3.9%
7
Nature Medicine
117 papers in training set
Top 0.9%
3.6%
8
Nature Human Behaviour
85 papers in training set
Top 0.9%
3.6%
50% of probability mass above
9
Nature Machine Intelligence
61 papers in training set
Top 1%
3.0%
10
Nature Biomedical Engineering
42 papers in training set
Top 0.5%
2.6%
11
eLife
5422 papers in training set
Top 32%
2.6%
12
PLOS Computational Biology
1633 papers in training set
Top 12%
2.6%
13
PLOS ONE
4510 papers in training set
Top 48%
2.1%
14
Journal of the American Medical Informatics Association
61 papers in training set
Top 1%
1.8%
15
Med
38 papers in training set
Top 0.2%
1.8%
16
Science Translational Medicine
111 papers in training set
Top 3%
1.7%
17
JCO Clinical Cancer Informatics
18 papers in training set
Top 0.5%
1.7%
18
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 35%
1.5%
19
Nature
575 papers in training set
Top 12%
1.5%
20
Communications Psychology
20 papers in training set
Top 0.1%
1.5%
21
Annals of Internal Medicine
27 papers in training set
Top 0.5%
1.3%
22
Science
429 papers in training set
Top 17%
1.2%
23
Nature Methods
336 papers in training set
Top 5%
1.2%
24
iScience
1063 papers in training set
Top 24%
0.9%
25
GENETICS
189 papers in training set
Top 1%
0.9%
26
The Lancet Digital Health
25 papers in training set
Top 1%
0.8%
27
Communications Medicine
85 papers in training set
Top 1%
0.7%
28
Science Advances
1098 papers in training set
Top 32%
0.7%
29
Nature Computational Science
50 papers in training set
Top 2%
0.7%
30
PNAS Nexus
147 papers in training set
Top 2%
0.7%