Back

MedScope: A Lightweight Benchmark of Open-Source Large Language Models for Medical Question Answering

Bian, R.; Cheng, W.

2026-04-01 health informatics
10.64898/2026.03.31.26349827 medRxiv
Show abstract

The rapid development of large language models (LLMs) has stimulated growing interest in their use for medical question answering and clinical decision support. However, compared with frontier proprietary systems, the empirical understanding of lightweight open-source LLMs in medical settings remains limited, particularly under resource-constrained experimental conditions. To address this gap, we introduce MedScope, a lightweight benchmarking framework for systematically evaluating open-source LLMs on medical multiple-choice question answering. Using 1,000 sampled questions from MedMCQA, we benchmark six lightweight open-source models spanning three representative model families: LLaMA, Qwen, and Gemma. Beyond standard predictive metrics such as accuracy and macro-F1, our framework additionally considers inference time, prediction consistency, subject-wise variability, and model-specific error patterns. We further develop a set of multi-perspective visual analyses, including clustered heatmaps, agreement matrices, Pareto-style trade-off plots, radar charts, and multi-panel summary figures, in order to characterize model behavior in a more interpretable and comprehensive manner. Our results reveal substantial heterogeneity across models in predictive performance, efficiency, and subject-level robustness. While larger lightweight models generally achieve better overall results, the gain is neither uniform across subject categories nor always aligned with efficiency. These findings suggest that lightweight open-source LLMs remain valuable as transparent and reproducible medical AI baselines, but their current capabilities are still insufficient for unsupervised deployment in high-risk healthcare scenarios. MedScope provides an accessible benchmark for evaluating lightweight medical LLMs and emphasizes the need for multi-dimensional assessment beyond accuracy alone.The relevant code is now open-sourced at: https://github.com/VhoCheng/MedScope.

Matching journals

The top 6 journals account for 50% of the predicted probability mass.

1
npj Digital Medicine
97 papers in training set
Top 0.1%
25.4%
2
Journal of Biomedical Informatics
45 papers in training set
Top 0.2%
6.8%
3
Scientific Reports
3102 papers in training set
Top 19%
6.3%
4
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.5%
4.8%
5
International Journal of Medical Informatics
25 papers in training set
Top 0.3%
4.3%
6
PLOS Digital Health
91 papers in training set
Top 0.6%
3.9%
50% of probability mass above
7
Bioinformatics
1061 papers in training set
Top 5%
3.6%
8
Artificial Intelligence in Medicine
15 papers in training set
Top 0.1%
3.6%
9
BMC Medical Informatics and Decision Making
39 papers in training set
Top 0.8%
3.6%
10
Computers in Biology and Medicine
120 papers in training set
Top 1%
3.0%
11
IEEE Journal of Biomedical and Health Informatics
34 papers in training set
Top 0.7%
2.6%
12
JCO Clinical Cancer Informatics
18 papers in training set
Top 0.4%
2.1%
13
Journal of Medical Internet Research
85 papers in training set
Top 3%
1.7%
14
Nature Medicine
117 papers in training set
Top 2%
1.7%
15
Frontiers in Digital Health
20 papers in training set
Top 0.8%
1.5%
16
Biology Methods and Protocols
53 papers in training set
Top 1%
1.3%
17
iScience
1063 papers in training set
Top 20%
1.3%
18
JMIR Medical Informatics
17 papers in training set
Top 1.0%
1.3%
19
Nature Machine Intelligence
61 papers in training set
Top 3%
1.2%
20
Patterns
70 papers in training set
Top 1%
1.2%
21
GigaScience
172 papers in training set
Top 2%
1.1%
22
Nature Communications
4913 papers in training set
Top 58%
0.9%
23
The Lancet Digital Health
25 papers in training set
Top 0.8%
0.9%
24
Journal of Personalized Medicine
28 papers in training set
Top 0.9%
0.9%
25
PLOS ONE
4510 papers in training set
Top 66%
0.8%
26
Computer Methods and Programs in Biomedicine
27 papers in training set
Top 0.9%
0.8%
27
BMC Bioinformatics
383 papers in training set
Top 7%
0.7%
28
JAMIA Open
37 papers in training set
Top 2%
0.7%
29
Cureus
67 papers in training set
Top 6%
0.6%
30
Communications Medicine
85 papers in training set
Top 2%
0.6%