Back

A Systematic Exploration of LLM Behavior for EHR phenotyping

Yamga, E.; Murphy, S.; Despres, P.

2026-04-24 health informatics
10.64898/2026.04.16.26350890 medRxiv
Show abstract

Background Electronic health record (EHR) phenotyping underpins observational research, cohort discovery, and clinical trial screening. Large language models (LLMs) offer new capabilities for extracting phenotypes from unstructured text, but their performance depends on pipeline design choices-including prompting, text segmentation, and aggregation. No systematic framework has previously examined how these parameters shape accuracy and reproducibility. Methods We evaluated LLM-based phenotyping pipelines using 1,388 discharge summaries across 16 clinical phenotypes. A full factorial experiment with LLaMA-3B, 8B, and 70B systematically varied three pipeline components: prompting (zero-shot, few-shot, chain-of-thought, extract-then-phenotype), chunking (none, naive, document-based), and aggregation (any-positive, two-vote, majority), yielding 24 configurations per model. To compare intrinsic model capabilities, biomedical domain-adapted, commercial frontier (LLaMA-405B, GPT-4o, Gemini Flash 2.0), and reasoning-optimized models (DeepSeek-R1) were evaluated under a fixed configuration. Performance was assessed using precision, recall, and macro-F1; secondary analyses examined prediction consistency (Shannon entropy), self-confidence calibration, and the development of a taxonomy of recurrent model errors. Results Factorial ANOVAs showed that chunking and aggregation were the dominant drivers of performance, whereas the prompting strategy contributed minimally. Configuration effects were stable across model sizes, with no significant Model x Parameter interactions. Phenotype difficulty varied substantially (macro-F1 = 0.40-0.90), yet the highest-performing configuration-whole-document inference without aggregation-was consistent across phenotypes, as confirmed by mixed-effects modeling. In cross-model comparisons, DeepSeek-R1 achieved the highest macro-F1 (0.89), while LLaMA-70B matched GPT-4o and LLaMA-405B at substantially lower cost. Prediction entropy was low overall and driven primarily by phenotype difficulty rather than prompting or temperature. Self-confidence calibration was only moderately informative: high-confidence predictions were more accurate, but larger models exhibited systematic overconfidence. Conclusions LLM performance in EHR phenotyping is governed primarily by input structure and model capacity, not prompt engineering. Simple, document-level inference yields robust performance across diverse phenotypes, providing practical design guidance for LLM-based cohort identification while underscoring the continued need for human oversight for challenging phenotypes.

Matching journals

The top 2 journals account for 50% of the predicted probability mass.

1
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.1%
39.3%
2
npj Digital Medicine
97 papers in training set
Top 0.3%
17.5%
50% of probability mass above
3
Journal of Biomedical Informatics
45 papers in training set
Top 0.3%
4.3%
4
JAMIA Open
37 papers in training set
Top 0.3%
4.3%
5
PLOS Digital Health
91 papers in training set
Top 0.7%
3.6%
6
Scientific Reports
3102 papers in training set
Top 41%
3.1%
7
The Lancet Digital Health
25 papers in training set
Top 0.2%
3.1%
8
JCO Clinical Cancer Informatics
18 papers in training set
Top 0.4%
2.1%
9
JAMA Network Open
127 papers in training set
Top 2%
1.8%
10
BMC Medical Research Methodology
43 papers in training set
Top 0.6%
1.7%
11
European Heart Journal - Digital Health
15 papers in training set
Top 0.4%
1.3%
12
BMJ Health & Care Informatics
13 papers in training set
Top 0.6%
1.2%
13
BMC Medical Informatics and Decision Making
39 papers in training set
Top 2%
1.2%
14
JMIR Medical Informatics
17 papers in training set
Top 1%
1.1%
15
Annals of Internal Medicine
27 papers in training set
Top 0.7%
0.9%
16
Frontiers in Digital Health
20 papers in training set
Top 1%
0.9%
17
Journal of Medical Internet Research
85 papers in training set
Top 4%
0.8%
18
iScience
1063 papers in training set
Top 29%
0.8%
19
Computers in Biology and Medicine
120 papers in training set
Top 4%
0.8%
20
Nature Medicine
117 papers in training set
Top 5%
0.7%
21
Inflammatory Bowel Diseases
15 papers in training set
Top 0.3%
0.7%
22
PLOS ONE
4510 papers in training set
Top 71%
0.6%