Back

Clinical Agents Don't Care

Klang, E.; Glicksberg, B. S.; Gorenshtein, A.; Gavin, N.; Freeman, R.; Stump, L.; Charney, A. W.; Ting, D. S. W.; Omar, M.; Nadkarni, G.

2025-10-19 health informatics
10.1101/2025.10.17.25338226 medRxiv
Show abstract

BackgroundLarge language models (LLMs) now power clinical agents that can plan, call tools, and write into electronic health records (EHRs). They are becoming actors, not assistants. Given known LLM faults, quality assurance is essential before clinical use. A key question is whether agents notice patient-identity errors or act indifferent. MethodsWe created a record environment using publicly available MIMIC-IV real-world emergency department data. Agents were instructed to copy ICD-10 codes from visit headers into patient records using Extract and Store tools, with an option to record "UNKNOWN" if uncertain or abstain. Each agent was presented with ten batched records from the same patient (clean version). Then we tampered one of the records and evaluated how the agent responded. We ran four separate batches: the clean baseline batch, a batch with one visit with a complete swapped header from another patient, a batch with one visit with a one-digit MRN change, and a batch with age shifted in one visit. Six models, both closed- and open-weight, completed 1.2 million tool calls to assess model performance. The endpoint was whether agents would identify when fields were inconsistent identity. ResultsAgents frequently failed, copying codes into tampered charts. GPT-4.1 flagged mismatched headers as UNKNOWN in 17.4% of runs but never detected subtle faults. GPT-4.1-nano detected 4.4% of header swaps and <1% of MRN or age changes. GPT-5-chat never identified mismatches but omitted responses in 12.6% of cases. Other models rarely abstained. Subtle tampering passed almost entirely without detection. ConclusionsClinical agents are often indifferent to patient details inconsistencies. The central risk is misbinding, not miscoding. Safe deployment requires explicit identity verification, abstention when uncertain, and benchmarks that treat record integrity, not just accuracy, as a primary outcome.

Matching journals

The top 2 journals account for 50% of the predicted probability mass.

1
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.1%
49.5%
2
npj Digital Medicine
97 papers in training set
Top 0.3%
14.5%
50% of probability mass above
3
JAMIA Open
37 papers in training set
Top 0.3%
4.1%
4
Journal of Medical Internet Research
85 papers in training set
Top 1%
3.5%
5
JCO Clinical Cancer Informatics
18 papers in training set
Top 0.3%
2.6%
6
Journal of Biomedical Informatics
45 papers in training set
Top 0.7%
1.9%
7
BMJ Health & Care Informatics
13 papers in training set
Top 0.3%
1.9%
8
Frontiers in Digital Health
20 papers in training set
Top 0.7%
1.7%
9
The Lancet Digital Health
25 papers in training set
Top 0.4%
1.7%
10
PLOS Digital Health
91 papers in training set
Top 2%
1.5%
11
PLOS ONE
4510 papers in training set
Top 57%
1.5%
12
International Journal of Medical Informatics
25 papers in training set
Top 1%
1.1%
13
BMC Medical Informatics and Decision Making
39 papers in training set
Top 2%
0.9%
14
JMIR Medical Informatics
17 papers in training set
Top 1%
0.9%
15
Journal of General Internal Medicine
20 papers in training set
Top 0.9%
0.8%
16
JAMA Network Open
127 papers in training set
Top 4%
0.7%
17
Scientific Reports
3102 papers in training set
Top 75%
0.7%
18
Nature Communications
4913 papers in training set
Top 66%
0.6%
19
Philosophical Transactions of the Royal Society B
51 papers in training set
Top 7%
0.6%
20
JAMA Pediatrics
10 papers in training set
Top 0.2%
0.6%