Back

Large language models and retrieval augmented generation for complex clinical codelists: evaluating performance and assessing failure modes

Matthewman, J.; Denaxas, S.; Langan, S.; Painter, J. L.; Bate, A.

2026-04-24 health informatics
10.64898/2026.04.23.26351098 medRxiv
Show abstract

Objectives: Large language models (LLMs) have shown promise in creating clinical codelists for research purposes, a time-consuming task requiring expert domain knowledge. Here, we evaluate the performance and assess failure modes of a retrieval augmented generation (RAG) approach to creating clinical codelists for the large and complex medical terminology used by the Clinical Practice Research Datalink (CPRD). Materials & Methods: We set up a RAG system using a database of word embeddings of the medical terminology that we created using a general-purpose word embedding model (gemini-embedding). We developed 7 reference codelists presenting different challenges and tagged required and optional codes. We ran 168 evaluations (7 codelists, 2 different database subsets, 4 models, 3 epochs each). Scoring was based on the omission of required codes, and inclusion of irrelevant codes. We used model-grading (i.e., grading by another LLM with the reference codelists provided as context) to evaluate the output codelists (a score of 0% being all incorrect and 100% being all correct). Results: We saw varying accuracy across models and codelists, with Gemini 3 Pro (Score 43%) generally performing better than Claude Sonnet 4.6 (36%), Gemini 3 Flash, and OpenAI GPT 5.2 performing worst (14%). Models performed better with shorter target codelists (e.g., Eosinophilic esophagitis with four codes, and Hidradenitis suppurativa with 14 codes). For example, all models consistently failed to produce a complete Wrist fracture codelist (with 214 required codes). We further present evaluation summaries, and failure mode evaluations produced by parsing LLM chat logs. Discussion: Besides demonstrating that a single-shot RAG approach is currently not suitable for codelist generation, we demonstrate failure modes including hallucinations, retrieval failures and generation failures where retrieved codes are not used. Conclusions: Our findings suggest that while RAG systems using current frontier LLMs may create correct clinical codelists in some cases, they still struggle with large and complex terminologies and codelists with a large number of codes. The failure mode we highlight can inform the creation of future workflows to avoid failures.

Matching journals

The top 7 journals account for 50% of the predicted probability mass.

1
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.2%
12.2%
2
Journal of Biomedical Informatics
45 papers in training set
Top 0.2%
8.2%
3
BMC Medical Informatics and Decision Making
39 papers in training set
Top 0.3%
7.0%
4
International Journal of Medical Informatics
25 papers in training set
Top 0.1%
7.0%
5
Journal of Medical Internet Research
85 papers in training set
Top 0.7%
6.7%
6
npj Digital Medicine
97 papers in training set
Top 0.8%
6.2%
7
JAMIA Open
37 papers in training set
Top 0.2%
6.2%
50% of probability mass above
8
Frontiers in Digital Health
20 papers in training set
Top 0.1%
4.7%
9
Artificial Intelligence in Medicine
15 papers in training set
Top 0.1%
3.9%
10
JCO Clinical Cancer Informatics
18 papers in training set
Top 0.3%
3.5%
11
BMJ Health & Care Informatics
13 papers in training set
Top 0.2%
3.5%
12
JMIR Medical Informatics
17 papers in training set
Top 0.4%
3.0%
13
Scientific Reports
3102 papers in training set
Top 45%
2.7%
14
Biology Methods and Protocols
53 papers in training set
Top 0.6%
2.0%
15
Computers in Biology and Medicine
120 papers in training set
Top 2%
1.7%
16
PLOS ONE
4510 papers in training set
Top 56%
1.6%
17
PLOS Digital Health
91 papers in training set
Top 2%
1.3%
18
Bioinformatics
1061 papers in training set
Top 8%
1.2%
19
BMC Medical Research Methodology
43 papers in training set
Top 1%
0.9%
20
IEEE Journal of Biomedical and Health Informatics
34 papers in training set
Top 2%
0.9%
21
Healthcare
16 papers in training set
Top 1%
0.9%
22
Computer Methods and Programs in Biomedicine
27 papers in training set
Top 0.9%
0.8%
23
Cureus
67 papers in training set
Top 5%
0.7%
24
iScience
1063 papers in training set
Top 34%
0.7%
25
Cancer Medicine
24 papers in training set
Top 2%
0.7%
26
BMC Bioinformatics
383 papers in training set
Top 8%
0.6%
27
Frontiers in Artificial Intelligence
18 papers in training set
Top 1.0%
0.6%