Back

To RAG, or Not to RAG? A Comparative Evaluation of Retrieval-Augmented Generation for ICD Coding of German Tumor Diagnoses

Alickovic, F.; Lenz, S.; Ustjanzew, A.; Ortiz Rosario, L.; Vollmar, G. M.; Kindler, T.; Panholzer, T.

2026-06-03 health informatics
10.64898/2026.05.27.26353695 medRxiv
Show abstract

Introduction Coding tumor diagnoses from free-text clinical documentation currently requires substantial manual effort. Promising approaches for automating this process include large language mod-els (LLMs), embedding models, and retrieval-augmented generation (RAG). While previous studies often focus on a single method, we directly compare these approaches on a real-world dataset of tumor diagnosis descriptions to assess their strengths and limitations. Methods We evaluated nine different embedding models using similarity search and embedding-based classification, as well as LLM-based coding, with and without RAG, on a real-world dataset of 2,024 unique German tumor diagnosis descriptions labeled with ICD-10 and ICD-O topography codes. The retrieval knowledge base was constructed exclusively from stand-ardized Alpha-ID, ICD-10-GM, and ICD-O-3 classifications. Performance was assessed for exact (full-code) and partial (three-character) code prediction. For RAG, we evaluated base and fine-tuned versions of Llama 3.1 8B and Llama 3.3 70B. Results Qwen3-Embedding-8B, the largest embedding model, yielded the best results. It achieved 47.8% exact-match and 72.1% partial-match accuracy for ICD-10 coding with classification, and 42.7% exact-match and 73.5% partial-match accuracy for ICD-O coding with similarity search. The other embedding models, including medically specialized ones, showed varied but lower performance. RAG improved base LLM perfor-mance and outperformed embedding-based approaches on partial-match accura-cy (80.6% partial-match accuracy for ICD-10 and 75.0% for ICD-O with Llama 3.3 70B), but not on exact-match accuracy. Conclusion A direct comparison with embedding-based approaches is essential to determine whether the additional effort of RAG is justified. The strong variation in performance also highlights the importance of model selection. Further advances in embedding-based methods, potential-ly supported by larger and more diverse training data, may offer a promising direction for future work.

Matching journals

The top 10 journals account for 50% of the predicted probability mass.

1
Artificial Intelligence in Medicine
15 papers in training set
Top 0.1%
8.5%
2
Scientific Reports
3102 papers in training set
Top 12%
7.2%
3
BMC Medical Informatics and Decision Making
39 papers in training set
Top 0.4%
6.4%
4
JCO Clinical Cancer Informatics
18 papers in training set
Top 0.1%
6.4%
5
PLOS ONE
4510 papers in training set
Top 35%
4.0%
6
Journal of Medical Internet Research
85 papers in training set
Top 1%
3.7%
7
Biology Methods and Protocols
53 papers in training set
Top 0.3%
3.6%
8
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.8%
3.6%
9
Bioinformatics
1061 papers in training set
Top 5%
3.6%
10
JMIR Medical Informatics
17 papers in training set
Top 0.4%
3.1%
50% of probability mass above
11
Cancer Medicine
24 papers in training set
Top 0.4%
2.7%
12
BMC Bioinformatics
383 papers in training set
Top 3%
2.6%
13
International Journal of Medical Informatics
25 papers in training set
Top 0.5%
2.6%
14
Frontiers in Digital Health
20 papers in training set
Top 0.5%
2.1%
15
BMJ Health & Care Informatics
13 papers in training set
Top 0.3%
2.1%
16
Journal of Biomedical Informatics
45 papers in training set
Top 0.7%
1.9%
17
Journal of Personalized Medicine
28 papers in training set
Top 0.3%
1.7%
18
Frontiers in Artificial Intelligence
18 papers in training set
Top 0.3%
1.7%
19
Healthcare
16 papers in training set
Top 0.6%
1.7%
20
BMC Medical Research Methodology
43 papers in training set
Top 0.7%
1.5%
21
BMJ Open
554 papers in training set
Top 10%
1.5%
22
eBioMedicine
130 papers in training set
Top 2%
1.3%
23
iScience
1063 papers in training set
Top 21%
1.2%
24
The Lancet Digital Health
25 papers in training set
Top 0.7%
1.1%
25
Scientific Data
174 papers in training set
Top 2%
1.1%
26
Diagnostics
48 papers in training set
Top 2%
1.0%
27
Bioinformatics Advances
184 papers in training set
Top 4%
0.9%
28
Data in Brief
13 papers in training set
Top 0.3%
0.9%
29
Computer Methods and Programs in Biomedicine
27 papers in training set
Top 0.7%
0.9%
30
Computers in Biology and Medicine
120 papers in training set
Top 4%
0.8%