Back

Medicalbench: Evaluating Large Language Models Towards Improved Medical Concept Extraction

Yang, Z.; Lyng, G. D.; Batra, S. S.; Tillman, R. E.

2026-04-16 health informatics
10.64898/2026.04.12.26350704 medRxiv
Show abstract

Medical concept extraction from electronic health records underpins many downstream applications, yet remains challenging because medically meaningful concepts, such as diagnoses, are frequently implied rather than explicitly stated in medical narratives. Existing benchmarks with human-annotated evidence spans underscore the importance of grounding extracted concepts in medical text. However, they predominantly focus on explicitly stated concepts and provide limited coverage of cases in which medically relevant concepts must be inferred. We present MedicalBench, a new benchmark for medical concept extraction with evidence grounding that evaluates implicit medical reasoning. MedicalBench formulates medical concept extraction as a verification task over medical note concept pairs, coupled with sentence level evidence identification. Built from MIMIC-IV discharge summaries and human verified ICD-10 codes, the dataset is curated through a multi stage large language model (LLM) triage pipeline followed by medical annotation and expert review. It deliberately includes implicit positives, semantically confusable negatives, and cases where LLM judgments disagree with medical expert assessments. Annotators provide sentence level evidence spans and concise medical rationales. The final dataset contains 823 high quality examples. We define two complementary evaluation tasks: (1) medical concept extraction and (2) sentence level evidence retrieval, enabling assessment of both correctness and interpretability. Benchmarking state-of-the-art LLMs and a supervised baseline reveals that performance remains modest, highlighting the difficulty of extracting implicitly expressed concepts. We further show that explicitly incorporating reasoning cues and prompting to extract implicit evidence substantially improves medical concept extractions, while performance is largely invariant to note length, indicating that MedicalBench isolates reasoning difficulty rather than superficial confounders. MedicalBench provides the first systematic benchmark for implicit, evidence-grounded medical concept extraction, offering a foundation for developing medical language models that can both identify medically relevant concepts and justify their predictions in a transparent and medically faithful manner.

Matching journals

The top 3 journals account for 50% of the predicted probability mass.

1
npj Digital Medicine
97 papers in training set
Top 0.1%
23.6%
2
Journal of Biomedical Informatics
45 papers in training set
Top 0.1%
19.6%
3
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.4%
7.1%
50% of probability mass above
4
Scientific Reports
3102 papers in training set
Top 21%
5.1%
5
JCO Clinical Cancer Informatics
18 papers in training set
Top 0.2%
4.4%
6
The Lancet Digital Health
25 papers in training set
Top 0.2%
3.0%
7
International Journal of Medical Informatics
25 papers in training set
Top 0.6%
2.2%
8
Med
38 papers in training set
Top 0.1%
2.2%
9
Nature Medicine
117 papers in training set
Top 1%
2.2%
10
JMIR Medical Informatics
17 papers in training set
Top 0.6%
2.0%
11
BMC Medical Informatics and Decision Making
39 papers in training set
Top 1%
2.0%
12
Bioinformatics
1061 papers in training set
Top 7%
1.8%
13
JAMIA Open
37 papers in training set
Top 0.8%
1.8%
14
Nature Communications
4913 papers in training set
Top 50%
1.8%
15
PLOS Digital Health
91 papers in training set
Top 2%
1.6%
16
PLOS ONE
4510 papers in training set
Top 59%
1.3%
17
Artificial Intelligence in Medicine
15 papers in training set
Top 0.5%
0.9%
18
Science Translational Medicine
111 papers in training set
Top 5%
0.8%
19
Frontiers in Digital Health
20 papers in training set
Top 1%
0.8%
20
Journal of Medical Internet Research
85 papers in training set
Top 4%
0.8%
21
iScience
1063 papers in training set
Top 31%
0.8%
22
Patterns
70 papers in training set
Top 3%
0.7%
23
Nature Biomedical Engineering
42 papers in training set
Top 2%
0.7%
24
Science Advances
1098 papers in training set
Top 34%
0.5%