Back

Boards-style benchmarks overestimate prior-chat bias in large language models: a factorial evaluation study

Stanwyck, C.; Adibi, A.; Dozie-Nnamah, P.; Alsentzer, E.

2026-02-14 health informatics
10.64898/2026.02.12.26346164 medRxiv
Show abstract

BackgroundLarge language models (LLMs) are increasingly piloted as chat interfaces for chart review and clinical decision support. Although leading models achieve and even exceed physician-level accuracy on exam-style benchmarks such as MedQA, recent perturbation studies show large drops in accuracy after small changes to prompts, distractor content, or answer format. Prior work has not systematically examined how these vulnerabilities unintentionally manifest in clinically realistic settings, including multi-turn chatbot interactions, free-text response formats, and tasks involving patient medical records. MethodsWe evaluated susceptibility to bias from prior chat messages across 14 LLMs (10 closed-source, 4 open-source) on two medical question-answering tasks: a boards-style benchmark (1000 MedQA test questions) and an electronic health record (EHR) information retrieval task (962 EHRNoteQA questions about real patient discharge summaries). Using a factorial design, we independently varied the presence and type of prior-chat distractors and response format across these two tasks. Distractors ranged from simple statements of incorrect answers to more realistic conversational exchanges between user and model, including interactions referencing a different patient. FindingsPrior-chat distractors produced large and consistent accuracy decrements in the MedQA multiple-choice setting, particularly when the prior message stated an incorrect answer. In this setting, insertion of this user message led to significant accuracy decreases in 13 of 14 models, with drops averaging 15{middle dot}0 percentage points across models. Effects were smaller for more plausible, conversational distractors and in free-response formats. In contrast, prior-chat bias in the discharge summary-based task was modest and inconsistent. Average accuracy decreases were under 2 percentage points across all distractor types and response formats assessed, with significant effects observed in a minority of models. InterpretationLLM performance can be biased toward incorrect answers by plausible prior-chat distractors, but these effects are highly context-dependent. We find that distraction effects are common and often substantial in the boards-style multiple-choice task, particularly when the distractor is an explicit (and unrealistic) prior message containing an incorrect answer. In contrast, these effects are markedly attenuated when the same questions are posed in free-response format and the distractor is incorporated into a clinically-realistic user-model exchange in the chat history, or when the task is switched from a boards-style vignette to a question about a real (de-identified) patient record. Taken together, these results suggest that evaluations based solely on single-turn, boards-style multiple-choice questions with unrealistic distractors may overstate the impact of prior-chat bias. These findings highlight the need to assess LLM behavior in multi-turn settings involving realistic clinical use cases, rather than relying on boards-style benchmarks for assessment of safety risks.

Matching journals

The top 4 journals account for 50% of the predicted probability mass.

1
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.1%
19.5%
2
npj Digital Medicine
97 papers in training set
Top 0.2%
18.7%
3
Frontiers in Digital Health
20 papers in training set
Top 0.1%
6.4%
4
Journal of Medical Internet Research
85 papers in training set
Top 0.8%
6.3%
50% of probability mass above
5
Journal of General Internal Medicine
20 papers in training set
Top 0.1%
4.9%
6
JAMIA Open
37 papers in training set
Top 0.3%
4.3%
7
BMJ Health & Care Informatics
13 papers in training set
Top 0.2%
3.6%
8
JCO Clinical Cancer Informatics
18 papers in training set
Top 0.3%
2.7%
9
PLOS Digital Health
91 papers in training set
Top 1%
2.5%
10
JAMA Network Open
127 papers in training set
Top 2%
2.1%
11
Scientific Reports
3102 papers in training set
Top 50%
2.1%
12
PLOS ONE
4510 papers in training set
Top 50%
1.9%
13
JMIR Medical Informatics
17 papers in training set
Top 0.8%
1.5%
14
Critical Care Explorations
15 papers in training set
Top 0.3%
1.3%
15
Philosophical Transactions of the Royal Society B
51 papers in training set
Top 4%
1.3%
16
BMC Medical Informatics and Decision Making
39 papers in training set
Top 2%
1.3%
17
International Journal of Medical Informatics
25 papers in training set
Top 1%
1.3%
18
Journal of Biomedical Informatics
45 papers in training set
Top 1.0%
1.3%
19
Healthcare
16 papers in training set
Top 1%
1.0%
20
Annals of Internal Medicine
27 papers in training set
Top 0.7%
0.9%
21
Computers in Biology and Medicine
120 papers in training set
Top 4%
0.8%
22
JMIR Formative Research
32 papers in training set
Top 2%
0.8%
23
JAMA Pediatrics
10 papers in training set
Top 0.2%
0.8%
24
iScience
1063 papers in training set
Top 34%
0.7%
25
Heliyon
146 papers in training set
Top 8%
0.6%
26
Inflammatory Bowel Diseases
15 papers in training set
Top 0.3%
0.6%
27
BMC Medical Research Methodology
43 papers in training set
Top 2%
0.5%
28
JAMA
17 papers in training set
Top 0.6%
0.5%