OncoRAG: Graph-Based Retrieval Enabling Clinical Phenotyping from Oncology Notes Using Local Mid-Size Language Models
Salome, P.; Knoll, M.; Walz, D.; Cogno, N.; Dedeoglu, A. S.; Qi, A. L.; Isakoff, S. J.; Abdollahi, A.; Jimenez, R. B.; Bitterman, D. S.; Paganetti, H.; Chamseddine, I.
Show abstract
IntroductionManual data extraction from unstructured clinical notes is labor-intensive and impractical for large-scale clinical and research operations. Existing automated approaches typically require large language models, dedicated computational infrastructure, and/or task-specific fine-tuning that depends on curated data. The objective of this study is to enable accurate extraction with smaller locally deployed models using a disease-site specific pipeline and prompt configuration that are optimized and reusable. Materials/MethodsWe developed OncoRAG, a four-phase pipeline that (1) generates feature-specific search terms via ontology enrichment, (2) constructs a clinical knowledge graph from notes using biomedical named entity recognition, (3) retrieves relevant context using graph-diffusion reranking, and (4) extracts features via structured prompts. We ran OncoRAG using Microsoft Phi-3-medium-instruct (14B parameters), a mid-size language model deployed locally via Ollama. The pipeline was applied to three cohorts: triple-negative breast cancer (TNBC; npatients=104, nfeatures=42; primary development), recurrent high-grade glioma (RiCi; npatients=191, nfeatures=19; cross-lingual validation in German), and MIMIC-IV (npatients=100, nfeatures=10; external testing). Downstream task utility was assessed by comparing survival models for 3-year progression-free survival built from automatically extracted versus manually curated features. ResultsThe pipeline achieved mean F1 scores of 0.80 {+/-} 0.07 (TNBC; npatients=44, nfeatures=42), 0.79 {+/-} 0.12 (RiCi; npatients=61, nfeatures=19), and 0.84 {+/-} 0.06 (MIMIC-IV; npatients=100, nfeatures=10) on test sets under the automatic configuration. Compared to direct LLM prompting and naive RAG baselines, OncoRAG improved the mean F1-score by 0.19 to 0.22 and 0.17 to 0.19, respectively. Manual configuration refinement further improved the F1-score to 0.83 (TNBC) and 0.81 (RiCi), with no change in MIMIC-IV. Extraction time averaged 1.7-1.9 seconds per feature with the 14B model. Substituting a smaller 3.8B model reduced extraction time by 57%, with a decrease in F1-score (0.03-0.10). For TNBC, the extraction time was reduced from approximately two weeks of manual abstraction to under 2.5 hours. In an exploratory survival analysis, models using automatically extracted features showed a comparable C-index to those with manual curation (0.77 vs 0.76; 12 events). ConclusionsOncoRAG, deployed locally using a mid-size language model, achieved accurate feature extraction from multilingual oncology notes without fine-tuning. It was validated against manual extraction for both retrieval accuracy and survival model development. This locally deployable approach, which requires no external data sharing, addresses a critical bottleneck in scalable oncology research. Graphical abstract O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=89 SRC="FIGDIR/small/26347717v1_ufig1.gif" ALT="Figure 1000"> View larger version (23K): org.highwire.dtl.DTLVardef@178a4e8org.highwire.dtl.DTLVardef@1928b7corg.highwire.dtl.DTLVardef@38f36org.highwire.dtl.DTLVardef@1af4d51_HPS_FORMAT_FIGEXP M_FIG C_FIG
Matching journals
The top 4 journals account for 50% of the predicted probability mass.