Back

Boards-style benchmarks overestimate prior-chat bias in large language models: a factorial evaluation study

2026-02-14 health informatics Title + abstract only
View on medRxiv
Show abstract

BackgroundLarge language models (LLMs) are increasingly piloted as chat interfaces for chart review and clinical decision support. Although leading models achieve and even exceed physician-level accuracy on exam-style benchmarks such as MedQA, recent perturbation studies show large drops in accuracy after small changes to prompts, distractor content, or answer format. Prior work has not systematically examined how these vulnerabilities unintentionally manifest in clinically realistic settings, i...

Predicted journal destinations