Back

Large language models and retrieval augmented generation for complex clinical codelists: evaluating performance and assessing failure modes

Matthewman, J.; Denaxas, S.; Langan, S.; Painter, J. L.; Bate, A.

2026-04-24 health informatics
10.64898/2026.04.23.26351098 medRxiv
Show abstract

ObjectivesLarge language models (LLMs) have shown promise in creating clinical codelists for research purposes, a time-consuming task requiring expert domain knowledge. Here, we evaluate the performance and assess failure modes of a retrieval augmented generation (RAG) approach to creating clinical codelists for the large and complex medical terminology used by the Clinical Practice Research Datalink (CPRD). Materials & MethodsWe set up a RAG system using a database of word embeddings of the medical terminology that we created using a general-purpose word embedding model (gemini-embedding). We developed 7 reference codelists presenting different challenges and tagged required and optional codes. We ran 168 evaluations (7 codelists, 2 different database subsets, 4 models, 3 epochs each). Scoring was based on the omission of required codes, and inclusion of irrelevant codes. We used model-grading (i.e., grading by another LLM with the reference codelists provided as context) to evaluate the output codelists (a score of 0% being all incorrect and 100% being all correct). ResultsWe saw varying accuracy across models and codelists, with Gemini 3 Pro (Score 43%) generally performing better than Claude Sonnet 4.6 (36%), Gemini 3 Flash, and OpenAI GPT 5.2 performing worst (14%). Models performed better with shorter target codelists (e.g., Eosinophilic esophagitis with four codes, and Hidradenitis suppurativa with 14 codes). For example, all models consistently failed to produce a complete Wrist fracture codelist (with 214 required codes). We further present evaluation summaries, and failure mode evaluations produced by parsing LLM chat logs. DiscussionBesides demonstrating that a single-shot RAG approach is currently not suitable for codelist generation, we demonstrate failure modes including hallucinations, retrieval failures and generation failures where retrieved codes are not used. ConclusionsOur findings suggest that while RAG systems using current frontier LLMs may create correct clinical codelists in some cases, they still struggle with large and complex terminologies and codelists with a large number of codes. The failure mode we highlight can inform the creation of future workflows to avoid failures.

Matching journals

The top 7 journals account for 50% of the predicted probability mass.

1
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.3%
9.9%
2
Journal of Medical Internet Research
85 papers in training set
Top 0.4%
9.9%
3
International Journal of Medical Informatics
25 papers in training set
Top 0.1%
8.3%
4
BMC Medical Informatics and Decision Making
39 papers in training set
Top 0.3%
8.3%
5
Journal of Biomedical Informatics
45 papers in training set
Top 0.2%
7.1%
6
npj Digital Medicine
97 papers in training set
Top 0.9%
6.2%
7
JAMIA Open
37 papers in training set
Top 0.3%
4.8%
50% of probability mass above
8
Frontiers in Digital Health
20 papers in training set
Top 0.1%
4.8%
9
Artificial Intelligence in Medicine
15 papers in training set
Top 0.1%
3.5%
10
JCO Clinical Cancer Informatics
18 papers in training set
Top 0.3%
3.5%
11
JMIR Medical Informatics
17 papers in training set
Top 0.3%
3.5%
12
BMJ Health & Care Informatics
13 papers in training set
Top 0.2%
3.0%
13
Scientific Reports
3102 papers in training set
Top 50%
2.1%
14
Biology Methods and Protocols
53 papers in training set
Top 0.7%
1.9%
15
Computers in Biology and Medicine
120 papers in training set
Top 2%
1.7%
16
PLOS ONE
4510 papers in training set
Top 55%
1.6%
17
Bioinformatics
1061 papers in training set
Top 7%
1.6%
18
BMC Medical Research Methodology
43 papers in training set
Top 0.9%
1.2%
19
PLOS Digital Health
91 papers in training set
Top 2%
1.1%
20
Computer Methods and Programs in Biomedicine
27 papers in training set
Top 0.8%
0.9%
21
Healthcare
16 papers in training set
Top 2%
0.8%
22
Cureus
67 papers in training set
Top 5%
0.8%
23
iScience
1063 papers in training set
Top 30%
0.8%
24
Cancer Medicine
24 papers in training set
Top 1%
0.7%
25
BMC Bioinformatics
383 papers in training set
Top 7%
0.7%
26
Frontiers in Artificial Intelligence
18 papers in training set
Top 1.0%
0.6%