Boards-style benchmarks overestimate prior-chat bias in large language models: a factorial evaluation study

Stanwyck, C.; Adibi, A.; Dozie-Nnamah, P.; Alsentzer, E.

2026-02-14 health informatics

10.64898/2026.02.12.26346164 medRxiv

Show abstract

BackgroundLarge language models (LLMs) are increasingly piloted as chat interfaces for chart review and clinical decision support. Although leading models achieve and even exceed physician-level accuracy on exam-style benchmarks such as MedQA, recent perturbation studies show large drops in accuracy after small changes to prompts, distractor content, or answer format. Prior work has not systematically examined how these vulnerabilities unintentionally manifest in clinically realistic settings, including multi-turn chatbot interactions, free-text response formats, and tasks involving patient medical records. MethodsWe evaluated susceptibility to bias from prior chat messages across 14 LLMs (10 closed-source, 4 open-source) on two medical question-answering tasks: a boards-style benchmark (1000 MedQA test questions) and an electronic health record (EHR) information retrieval task (962 EHRNoteQA questions about real patient discharge summaries). Using a factorial design, we independently varied the presence and type of prior-chat distractors and response format across these two tasks. Distractors ranged from simple statements of incorrect answers to more realistic conversational exchanges between user and model, including interactions referencing a different patient. FindingsPrior-chat distractors produced large and consistent accuracy decrements in the MedQA multiple-choice setting, particularly when the prior message stated an incorrect answer. In this setting, insertion of this user message led to significant accuracy decreases in 13 of 14 models, with drops averaging 15{middle dot}0 percentage points across models. Effects were smaller for more plausible, conversational distractors and in free-response formats. In contrast, prior-chat bias in the discharge summary-based task was modest and inconsistent. Average accuracy decreases were under 2 percentage points across all distractor types and response formats assessed, with significant effects observed in a minority of models. InterpretationLLM performance can be biased toward incorrect answers by plausible prior-chat distractors, but these effects are highly context-dependent. We find that distraction effects are common and often substantial in the boards-style multiple-choice task, particularly when the distractor is an explicit (and unrealistic) prior message containing an incorrect answer. In contrast, these effects are markedly attenuated when the same questions are posed in free-response format and the distractor is incorporated into a clinically-realistic user-model exchange in the chat history, or when the task is switched from a boards-style vignette to a question about a real (de-identified) patient record. Taken together, these results suggest that evaluations based solely on single-turn, boards-style multiple-choice questions with unrealistic distractors may overstate the impact of prior-chat bias. These findings highlight the need to assess LLM behavior in multi-turn settings involving realistic clinical use cases, rather than relying on boards-style benchmarks for assessment of safety risks.

Boards-style benchmarks overestimate prior-chat bias in large language models: a factorial evaluation study

Matching journals