Back

Medicalbench: Evaluating Large Language Models Towards Improved Medical Concept Extraction

Yang, Z.; Lyng, G. D.; Batra, S. S.; Tillman, R. E.

2026-04-16 health informatics
10.64898/2026.04.12.26350704 medRxiv
Show abstract

AO_SCPLOWBSTRACTC_SCPLOWMedical concept extraction from electronic health records underpins many downstream applications, yet remains challenging because medically meaningful concepts, such as diagnoses, are frequently implied rather than explicitly stated in medical narratives. Existing benchmarks with human-annotated evidence spans underscore the importance of grounding extracted concepts in medical text. However, they predominantly focus on explicitly stated concepts and provide limited coverage of cases in which medically relevant concepts must be inferred. We present MedicalBench, a new benchmark for medical concept extraction with evidence grounding that evaluates implicit medical reasoning. MedicalBench formulates medical concept extraction as a verification task over medical note concept pairs, coupled with sentence level evidence identification. Built from MIMIC-IV discharge summaries and human verified ICD-10 codes, the dataset is curated through a multi stage large language model (LLM) triage pipeline followed by medical annotation and expert review. It deliberately includes implicit positives, semantically confusable negatives, and cases where LLM judgments disagree with medical expert assessments. Annotators provide sentence level evidence spans and concise medical rationales. The final dataset contains 823 high quality examples. We define two complementary evaluation tasks: (1) medical concept extraction and (2) sentence level evidence retrieval, enabling assessment of both correctness and interpretability. Benchmarking state-of-the-art LLMs and a super-vised baseline reveals that performance remains modest, highlighting the difficulty of extracting implicitly expressed concepts. We further show that explicitly incorporating reasoning cues and prompting to extract implicit evidence substantially improves medical concept extractions, while performance is largely invariant to note length, indicating that MedicalBench isolates reasoning difficulty rather than superficial confounders. MedicalBench provides the first systematic benchmark for implicit, evidence-grounded medical concept extraction, offering a foundation for developing medical language models that can both identify medically relevant concepts and justify their predictions in a transparent and medically faithful manner.1

Matching journals

The top 3 journals account for 50% of the predicted probability mass.

1
npj Digital Medicine
97 papers in training set
Top 0.2%
22.9%
2
Journal of Biomedical Informatics
45 papers in training set
Top 0.1%
22.9%
3
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.3%
10.3%
50% of probability mass above
4
JCO Clinical Cancer Informatics
18 papers in training set
Top 0.2%
4.0%
5
Scientific Reports
3102 papers in training set
Top 30%
4.0%
6
BMC Medical Informatics and Decision Making
39 papers in training set
Top 1%
2.1%
7
The Lancet Digital Health
25 papers in training set
Top 0.3%
2.1%
8
International Journal of Medical Informatics
25 papers in training set
Top 0.6%
2.1%
9
Bioinformatics
1061 papers in training set
Top 7%
1.9%
10
JAMIA Open
37 papers in training set
Top 0.7%
1.9%
11
PLOS Digital Health
91 papers in training set
Top 1%
1.8%
12
Nature Medicine
117 papers in training set
Top 2%
1.7%
13
JMIR Medical Informatics
17 papers in training set
Top 0.7%
1.7%
14
Med
38 papers in training set
Top 0.3%
1.7%
15
Journal of Medical Internet Research
85 papers in training set
Top 3%
1.2%
16
Artificial Intelligence in Medicine
15 papers in training set
Top 0.4%
1.2%
17
Patterns
70 papers in training set
Top 2%
0.8%
18
iScience
1063 papers in training set
Top 28%
0.8%
19
Nature Communications
4913 papers in training set
Top 61%
0.8%
20
PLOS ONE
4510 papers in training set
Top 67%
0.8%
21
Frontiers in Digital Health
20 papers in training set
Top 1%
0.7%
22
Nature Biomedical Engineering
42 papers in training set
Top 3%
0.5%
23
BMJ Health & Care Informatics
13 papers in training set
Top 1%
0.5%
24
Science Translational Medicine
111 papers in training set
Top 8%
0.5%