Clinical Agents Don't Care
Klang, E.; Glicksberg, B. S.; Gorenshtein, A.; Gavin, N.; Freeman, R.; Stump, L.; Charney, A. W.; Ting, D. S. W.; Omar, M.; Nadkarni, G.
Show abstract
BackgroundLarge language models (LLMs) now power clinical agents that can plan, call tools, and write into electronic health records (EHRs). They are becoming actors, not assistants. Given known LLM faults, quality assurance is essential before clinical use. A key question is whether agents notice patient-identity errors or act indifferent. MethodsWe created a record environment using publicly available MIMIC-IV real-world emergency department data. Agents were instructed to copy ICD-10 codes from visit headers into patient records using Extract and Store tools, with an option to record "UNKNOWN" if uncertain or abstain. Each agent was presented with ten batched records from the same patient (clean version). Then we tampered one of the records and evaluated how the agent responded. We ran four separate batches: the clean baseline batch, a batch with one visit with a complete swapped header from another patient, a batch with one visit with a one-digit MRN change, and a batch with age shifted in one visit. Six models, both closed- and open-weight, completed 1.2 million tool calls to assess model performance. The endpoint was whether agents would identify when fields were inconsistent identity. ResultsAgents frequently failed, copying codes into tampered charts. GPT-4.1 flagged mismatched headers as UNKNOWN in 17.4% of runs but never detected subtle faults. GPT-4.1-nano detected 4.4% of header swaps and <1% of MRN or age changes. GPT-5-chat never identified mismatches but omitted responses in 12.6% of cases. Other models rarely abstained. Subtle tampering passed almost entirely without detection. ConclusionsClinical agents are often indifferent to patient details inconsistencies. The central risk is misbinding, not miscoding. Safe deployment requires explicit identity verification, abstention when uncertain, and benchmarks that treat record integrity, not just accuracy, as a primary outcome.
Matching journals
The top 2 journals account for 50% of the predicted probability mass.