Cosine Similarity Conflates Clinically Distinct Cancer Variants: A Case for Typed-Graph Retrieval in Precision Oncology Decision Support
Khan, U. A.
Show abstract
Retrieval-augmented generation (RAG) is increasingly applied to clinical decision support in oncology, where treatment selection depends on identifying a patients specific somatic variant from an NGS report and matching it to evidence-graded therapy options. The vector retrieval that underlies most RAG systems uses cosine similarity over text embeddings, an architecture optimized for linguistic proximity rather than entity-level identity. We hypothesize that cosine-similarity-based retrieval conflates clinically distinct cancer variants at clinically relevant rates, while a typed-graph approach in which each variant is a discrete node preserves variant-level identity by construction. We evaluated 9 cancer variant pairs known to have differential FDA-approved therapy indications, with variant identity informed by the CIViC clinical variant evidence database and primary clinical literature. Variant pairs included BRAF V600E vs V600K (melanoma), EGFR L858R vs T790M (NSCLC, the canonical sensitivity-vs-resistance pair), EGFR exon 19 deletion vs L858R, KRAS G12C vs G12D (only G12C has FDA-approved targeted therapy), KRAS G12C vs G12V, ERBB2 amplification vs activating mutation, two PIK3CA hotspot pairs, and NTRK1 fusion vs point mutation. We computed pairwise cosine similarity for each variants text representation across three open-source embedding models (PubMedBERT, MedCPT, BGE-large-en-v1.5) and three text formats (short, medium, long). Across the medium format (gene + variant + tumor type), 100% of clinically distinct variant pairs (9/9) had cosine similarity [≥] 0.95 under both biomedical encoders (PubMedBERT, MedCPT). The general-purpose encoder (BGE-large-en-v1.5) showed lower conflation in the medium format (11%) but rose to 100% with added clinical context. At the more stringent {tau} = 0.99 (averaged across formats), PubMedBERT conflated 56% of pairs and MedCPT conflated 22%. The biomedically pre-trained encoders performed worse, not better, than the general-purpose encoder. The typed-graph baseline achieves zero conflation by construction. We discuss the architectural implications: vector retrieval is appropriate for unstructured literature search but introduces unsafe ambiguity when used as the substrate for variant-level reasoning that drives drug-selection decisions. We argue that typed-graph retrieval should be the default architecture for any retrieval-grounded clinical decision support system that recommends targeted therapy.
Matching journals
The top 8 journals account for 50% of the predicted probability mass.