Back

A Systematic Exploration of LLM Behavior for EHR phenotyping

Yamga, E.; Murphy, S.; Despres, P.

2026-04-24 health informatics
10.64898/2026.04.16.26350890 medRxiv
Show abstract

BackgroundElectronic health record (EHR) phenotyping underpins observational research, cohort discovery, and clinical trial screening. Large language models (LLMs) offer new ca-pabilities for extracting phenotypes from unstructured text, but their performance depends on pipeline design choices-including prompting, text segmentation, and aggregation. No systematic framework has previously examined how these parameters shape accuracy and reproducibility. MethodsWe evaluated LLM-based phenotyping pipelines using 1,388 discharge summaries across 16 clinical phenotypes. A full factorial experiment with LLaMA-3B, 8B, and 70B systematically varied three pipeline components: prompting (zero-shot, few-shot, chain-of-thought, extract-then-phenotype), chunk-ing (none, naive, document-based), and aggregation (any-positive, two-vote, majority), yielding 24 configurations per model. To compare intrinsic model capabilities, biomedical domain-adapted, commercial frontier (LLaMA-405B, GPT-4o, Gemini Flash 2.0), and reasoning-optimized models (DeepSeek-R1) were evaluated under a fixed configuration. Performance was assessed using precision, re-call, and macro-F1; secondary analyses examined prediction consistency (Shannon entropy), self-confidence calibration, and the development of a taxonomy of recurrent model errors. ResultsFactorial ANOVAs showed that chunking and aggregation were the dominant drivers of performance, whereas the prompting strategy contributed minimally. Configuration effects were stable across model sizes, with no significant Model x Parameter interactions. Phenotype difficulty varied substantially (macro-F1 = 0.40-0.90), yet the highest-performing configuration-whole-document inference without aggregation-was consistent across phenotypes, as confirmed by mixed-effects modeling. In cross-model comparisons, DeepSeek-R1 achieved the highest macro-F1 (0.89), while LLaMA-70B matched GPT-4o and LLaMA-405B at substantially lower cost. Prediction entropy was low overall and driven primarily by phenotype difficulty rather than prompting or temperature. Self-confidence calibration was only moderately informative: high-confidence predictions were more accurate, but larger models exhibited systematic overconfidence. ConclusionsLLM performance in EHR phenotyping is governed primarily by input structure and model capacity, not prompt engineering. Simple, document-level inference yields robust performance across diverse phenotypes, providing practical design guidance for LLM-based co-hort identification while underscoring the continued need for human oversight for challenging phenotypes.

Matching journals

The top 2 journals account for 50% of the predicted probability mass.

1
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.1%
40.1%
2
npj Digital Medicine
97 papers in training set
Top 0.3%
18.6%
50% of probability mass above
3
PLOS Digital Health
91 papers in training set
Top 0.6%
4.0%
4
Journal of Biomedical Informatics
45 papers in training set
Top 0.4%
3.9%
5
JAMIA Open
37 papers in training set
Top 0.4%
3.6%
6
Scientific Reports
3102 papers in training set
Top 40%
3.1%
7
The Lancet Digital Health
25 papers in training set
Top 0.2%
2.8%
8
JCO Clinical Cancer Informatics
18 papers in training set
Top 0.4%
1.9%
9
JAMA Network Open
127 papers in training set
Top 2%
1.7%
10
Frontiers in Digital Health
20 papers in training set
Top 0.9%
1.2%
11
Journal of Medical Internet Research
85 papers in training set
Top 3%
1.2%
12
BMC Medical Research Methodology
43 papers in training set
Top 0.8%
1.2%
13
BMJ Health & Care Informatics
13 papers in training set
Top 0.6%
1.1%
14
BMC Medical Informatics and Decision Making
39 papers in training set
Top 2%
1.0%
15
JMIR Medical Informatics
17 papers in training set
Top 1%
1.0%
16
Annals of Internal Medicine
27 papers in training set
Top 0.7%
0.9%
17
iScience
1063 papers in training set
Top 31%
0.8%
18
Computers in Biology and Medicine
120 papers in training set
Top 4%
0.8%
19
Nature Medicine
117 papers in training set
Top 5%
0.7%
20
International Journal of Medical Informatics
25 papers in training set
Top 2%
0.7%
21
PLOS ONE
4510 papers in training set
Top 70%
0.7%
22
JMIR Public Health and Surveillance
45 papers in training set
Top 5%
0.5%
23
European Heart Journal - Digital Health
15 papers in training set
Top 0.7%
0.5%