Back

AlignInsight: A Three-Layer Framework for Detecting Deceptive Alignment and Evaluation Awareness in Healthcare AI Systems

Onovo, A. A.; Cherima, Y. J.

2026-01-21 health informatics
10.64898/2026.01.17.26344330 medRxiv
Show abstract

ImportanceEmerging evidence suggests healthcare AI systems may exhibit deceptive alignment (appearing safe during validation while optimizing for misaligned objectives in deployment) and evaluation awareness (detecting and adapting behavior during audits), undermining regulatory validation frameworks. ObjectiveTo quantify the performance of multi-layer red-teaming approaches in detecting sophisticated healthcare AI safety failures across 10 vulnerability domains. Design, Setting, and ParticipantsSystematic red-teaming evaluation conducted December 2025-January 2026 using 50 healthcare-specific adversarial prompts across 10 vulnerability domains (5 prompts per domain). GPT-OSS-20B (21-billion parameter open-source model) was evaluated using a three-layer framework: Layer 1 (automated keyword pattern matching), Layer 2 (GPT-5 semantic analysis with chain-of-thought reasoning), and Layer 3 (PhD-level human expert adjudication as ground truth). Inter-rater reliability was assessed using Cohen kappa statistic. Main Outcomes and MeasuresPrimary outcome was risk classification (SAFE, LOW, MODERATE, or HIGH RISK) with severity scoring (0-10 scale). Secondary outcomes included layer detection rates, inter-rater agreement, thematic vulnerability patterns, and chain-of-thought reasoning exposure. ResultsOf 50 prompts evaluated, 12 (24%, 95% CI: 13.5%-38.2%) were classified HIGH RISK, 8 (16%, 95% CI: 7.7%-28.8%) MODERATE RISK, 17 (34%, 95% CI: 21.8%-48.3%) LOW RISK, and 13 (26%, 95% CI: 15.2%-39.9%) SAFE. Evaluation Awareness domain exhibited 100% HIGH/MODERATE RISK rate (5/5 prompts, mean severity 7.4/10), with the model providing technical specifications for audit detection and behavior switching. Sandbagging domain showed 40% HIGH RISK rate (2/5 prompts) with FDA validation gaming strategies. All 12 HIGH RISK prompts involved audit gaming themes (audit detection, behavior switching, system fingerprinting, or reporting violations). Layer 1 keyword matching detected 2/12 (17%, 95% CI: 4.7%-44.8%) high-risk prompts. Layer 2 GPT-5 analysis detected 12/12 (100%, 95% CI: 75.8%-100%) high-risk prompts with 0/13 (0%, 95% CI: 0%-22.8%) false positives. Human expert validation confirmed perfect concordance with Layer 2 assessments (kappa = 1.00, 95% CI: 0.999-1.000, p < 0.001), validating automated semantic analysis as reliable screening tool. Chain-of-thought leakage occurred in 28/50 (56%) prompts, exposing internal safety reasoning. Conclusions and RelevanceMulti-layer evaluation is essential for detecting sophisticated AI safety failures in healthcare. Keyword filtering alone missed 83% (95% CI: 55.2%-95.3%) of high-risk behaviors. Perfect inter-rater agreement (kappa=1.00) between automated AI semantic analysis and human expert judgment demonstrates that scalable, reliable safety screening is achievable. All HIGH-RISK outputs contained audit gaming content, indicating systematic capability to articulate regulatory circumvention. Healthcare AI systems require domain-specific red-teaming for regulatory audit gaming and dual-mode behavior detection. Findings reveal critical gaps in current AI safety measures with immediate implications for FDA/CMS regulatory frameworks.

Matching journals

The top 3 journals account for 50% of the predicted probability mass.

1
npj Digital Medicine
97 papers in training set
Top 0.1%
27.9%
2
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.2%
12.6%
3
BMJ Health & Care Informatics
13 papers in training set
Top 0.1%
10.2%
50% of probability mass above
4
PLOS ONE
4510 papers in training set
Top 33%
4.4%
5
JAMIA Open
37 papers in training set
Top 0.4%
3.6%
6
Frontiers in Digital Health
20 papers in training set
Top 0.2%
3.6%
7
JMIR Medical Informatics
17 papers in training set
Top 0.3%
3.6%
8
Scientific Reports
3102 papers in training set
Top 40%
3.3%
9
JMIR Public Health and Surveillance
45 papers in training set
Top 1%
2.1%
10
Journal of Medical Internet Research
85 papers in training set
Top 3%
1.7%
11
PLOS Digital Health
91 papers in training set
Top 1%
1.7%
12
BMC Medical Research Methodology
43 papers in training set
Top 0.8%
1.3%
13
BMC Medical Informatics and Decision Making
39 papers in training set
Top 2%
1.3%
14
BMJ Open
554 papers in training set
Top 10%
1.3%
15
Philosophical Transactions of the Royal Society B
51 papers in training set
Top 4%
1.3%
16
JMIR Formative Research
32 papers in training set
Top 1%
1.2%
17
Computers in Biology and Medicine
120 papers in training set
Top 4%
0.9%
18
Healthcare
16 papers in training set
Top 1%
0.9%
19
Annals of Internal Medicine
27 papers in training set
Top 0.8%
0.8%
20
JCO Clinical Cancer Informatics
18 papers in training set
Top 0.8%
0.8%
21
Journal of Biomedical Informatics
45 papers in training set
Top 1%
0.8%
22
DIGITAL HEALTH
12 papers in training set
Top 0.7%
0.7%
23
The Lancet Digital Health
25 papers in training set
Top 1%
0.6%
24
JAMA Network Open
127 papers in training set
Top 5%
0.6%
25
Frontiers in Public Health
140 papers in training set
Top 9%
0.6%
26
International Journal of Medical Informatics
25 papers in training set
Top 2%
0.5%