Back

OncoRAG: Graph-Based Retrieval Enabling Clinical Phenotyping from Oncology Notes Using Local Mid-Size Language Models

Salome, P.; Knoll, M.; Walz, D.; Cogno, N.; Dedeoglu, A. S.; Qi, A. L.; Isakoff, S. J.; Abdollahi, A.; Jimenez, R. B.; Bitterman, D. S.; Paganetti, H.; Chamseddine, I.

2026-03-06 oncology
10.64898/2026.03.05.26347717 medRxiv
Show abstract

IntroductionManual data extraction from unstructured clinical notes is labor-intensive and impractical for large-scale clinical and research operations. Existing automated approaches typically require large language models, dedicated computational infrastructure, and/or task-specific fine-tuning that depends on curated data. The objective of this study is to enable accurate extraction with smaller locally deployed models using a disease-site specific pipeline and prompt configuration that are optimized and reusable. Materials/MethodsWe developed OncoRAG, a four-phase pipeline that (1) generates feature-specific search terms via ontology enrichment, (2) constructs a clinical knowledge graph from notes using biomedical named entity recognition, (3) retrieves relevant context using graph-diffusion reranking, and (4) extracts features via structured prompts. We ran OncoRAG using Microsoft Phi-3-medium-instruct (14B parameters), a mid-size language model deployed locally via Ollama. The pipeline was applied to three cohorts: triple-negative breast cancer (TNBC; npatients=104, nfeatures=42; primary development), recurrent high-grade glioma (RiCi; npatients=191, nfeatures=19; cross-lingual validation in German), and MIMIC-IV (npatients=100, nfeatures=10; external testing). Downstream task utility was assessed by comparing survival models for 3-year progression-free survival built from automatically extracted versus manually curated features. ResultsThe pipeline achieved mean F1 scores of 0.80 {+/-} 0.07 (TNBC; npatients=44, nfeatures=42), 0.79 {+/-} 0.12 (RiCi; npatients=61, nfeatures=19), and 0.84 {+/-} 0.06 (MIMIC-IV; npatients=100, nfeatures=10) on test sets under the automatic configuration. Compared to direct LLM prompting and naive RAG baselines, OncoRAG improved the mean F1-score by 0.19 to 0.22 and 0.17 to 0.19, respectively. Manual configuration refinement further improved the F1-score to 0.83 (TNBC) and 0.81 (RiCi), with no change in MIMIC-IV. Extraction time averaged 1.7-1.9 seconds per feature with the 14B model. Substituting a smaller 3.8B model reduced extraction time by 57%, with a decrease in F1-score (0.03-0.10). For TNBC, the extraction time was reduced from approximately two weeks of manual abstraction to under 2.5 hours. In an exploratory survival analysis, models using automatically extracted features showed a comparable C-index to those with manual curation (0.77 vs 0.76; 12 events). ConclusionsOncoRAG, deployed locally using a mid-size language model, achieved accurate feature extraction from multilingual oncology notes without fine-tuning. It was validated against manual extraction for both retrieval accuracy and survival model development. This locally deployable approach, which requires no external data sharing, addresses a critical bottleneck in scalable oncology research. Graphical abstract O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=89 SRC="FIGDIR/small/26347717v1_ufig1.gif" ALT="Figure 1000"> View larger version (23K): org.highwire.dtl.DTLVardef@178a4e8org.highwire.dtl.DTLVardef@1928b7corg.highwire.dtl.DTLVardef@38f36org.highwire.dtl.DTLVardef@1af4d51_HPS_FORMAT_FIGEXP M_FIG C_FIG

Matching journals

The top 4 journals account for 50% of the predicted probability mass.

1
JCO Clinical Cancer Informatics
18 papers in training set
Top 0.1%
33.2%
2
Frontiers in Oncology
95 papers in training set
Top 0.2%
10.2%
3
Artificial Intelligence in Medicine
15 papers in training set
Top 0.1%
6.4%
4
npj Digital Medicine
97 papers in training set
Top 1%
4.0%
50% of probability mass above
5
JAMA Network Open
127 papers in training set
Top 0.9%
3.6%
6
Journal of Translational Medicine
46 papers in training set
Top 0.3%
2.8%
7
PLOS ONE
4510 papers in training set
Top 45%
2.6%
8
Cancer Medicine
24 papers in training set
Top 0.5%
2.1%
9
European Journal of Cancer
10 papers in training set
Top 0.1%
1.9%
10
Scientific Reports
3102 papers in training set
Top 55%
1.8%
11
British Journal of Cancer
42 papers in training set
Top 0.9%
1.7%
12
iScience
1063 papers in training set
Top 15%
1.7%
13
Database
51 papers in training set
Top 0.5%
1.5%
14
npj Precision Oncology
48 papers in training set
Top 0.7%
1.3%
15
Journal of Medical Internet Research
85 papers in training set
Top 3%
1.2%
16
Clinical Cancer Research
58 papers in training set
Top 1%
1.2%
17
BMC Bioinformatics
383 papers in training set
Top 6%
1.1%
18
Annals of Oncology
13 papers in training set
Top 0.7%
1.0%
19
Biology Methods and Protocols
53 papers in training set
Top 2%
1.0%
20
Computers in Biology and Medicine
120 papers in training set
Top 4%
0.9%
21
JCO Precision Oncology
14 papers in training set
Top 0.3%
0.9%
22
BMC Medical Informatics and Decision Making
39 papers in training set
Top 3%
0.7%
23
BMC Cancer
52 papers in training set
Top 3%
0.7%
24
BMJ Open
554 papers in training set
Top 13%
0.7%
25
PLOS Computational Biology
1633 papers in training set
Top 27%
0.6%
26
JMIR Medical Informatics
17 papers in training set
Top 2%
0.5%
27
Cancers
200 papers in training set
Top 6%
0.5%
28
BMC Medicine
163 papers in training set
Top 9%
0.5%