Back

MedEvalArena: A Self-Generated, Peer-Judged Benchmark for Medical Reasoning

2026-01-29 health informatics Title + abstract only
View on medRxiv
Show abstract

Large Language Models (LLMs) demonstrate strong performance at medical specialty board multiple-choice question (MCQ) answering, however, underperform in more complex medical reasoning scenarios. This gap indicates a need for improving both LLM medical reasoning and evaluation paradigms. We introduce MedEvalArena, a framework in which LLMs engage in a symmetric round-robin format. Each model generates challenging board-style medical MCQs, then serves in an ensemble LLM-as-judge bench to adjudica...

Predicted journal destinations