Back
Top 0.2%
18.5%
Top 0.4%
15.3%
Top 0.7%
11.5%
Top 0.4%
9.8%
Top 1%
6.7%
Top 34%
6.7%
Top 2%
6.2%
Top 5%
4.4%
Top 2%
2.9%
Top 2%
2.9%
Top 1%
2.0%
Top 94%
1.6%
Top 42%
1.6%
Top 6%
1.0%
Top 1%
1.0%
Top 16%
0.7%
Top 5%
0.5%
MedEvalArena: A Self-Generated, Peer-Judged Benchmark for Medical Reasoning
2026-01-29
health informatics
Title + abstract only
View on medRxiv
Show abstract
Large Language Models (LLMs) demonstrate strong performance at medical specialty board multiple-choice question (MCQ) answering, however, underperform in more complex medical reasoning scenarios. This gap indicates a need for improving both LLM medical reasoning and evaluation paradigms. We introduce MedEvalArena, a framework in which LLMs engage in a symmetric round-robin format. Each model generates challenging board-style medical MCQs, then serves in an ensemble LLM-as-judge bench to adjudica...
Predicted journal destinations
1
Journal of the American Medical Informatics Association
53 training papers
2
npj Digital Medicine
85 training papers
3
PLOS Digital Health
88 training papers
4
Journal of Biomedical Informatics
37 training papers
5
JAMIA Open
35 training papers
6
Scientific Reports
701 training papers
7
BMC Medical Informatics and Decision Making
36 training papers
8
Journal of Medical Internet Research
81 training papers
9
Computers in Biology and Medicine
39 training papers
10
International Journal of Medical Informatics
25 training papers
11
JMIR Medical Informatics
16 training papers
12
PLOS ONE
1737 training papers
13
Nature Communications
483 training papers
14
BMC Medical Research Methodology
41 training papers
15
Patterns
15 training papers
16
Nature Medicine
88 training papers
17
Bioinformatics
24 training papers