Back

Cosine Similarity Conflates Clinically Distinct Cancer Variants: A Case for Typed-Graph Retrieval in Precision Oncology Decision Support

Khan, U. A.

2026-05-11 bioinformatics
10.64898/2026.05.05.723102 bioRxiv
Show abstract

Retrieval-augmented generation (RAG) is increasingly applied to clinical decision support in oncology, where treatment selection depends on identifying a patients specific somatic variant from an NGS report and matching it to evidence-graded therapy options. The vector retrieval that underlies most RAG systems uses cosine similarity over text embeddings, an architecture optimized for linguistic proximity rather than entity-level identity. We hypothesize that cosine-similarity-based retrieval conflates clinically distinct cancer variants at clinically relevant rates, while a typed-graph approach in which each variant is a discrete node preserves variant-level identity by construction. We evaluated 9 cancer variant pairs known to have differential FDA-approved therapy indications, with variant identity informed by the CIViC clinical variant evidence database and primary clinical literature. Variant pairs included BRAF V600E vs V600K (melanoma), EGFR L858R vs T790M (NSCLC, the canonical sensitivity-vs-resistance pair), EGFR exon 19 deletion vs L858R, KRAS G12C vs G12D (only G12C has FDA-approved targeted therapy), KRAS G12C vs G12V, ERBB2 amplification vs activating mutation, two PIK3CA hotspot pairs, and NTRK1 fusion vs point mutation. We computed pairwise cosine similarity for each variants text representation across three open-source embedding models (PubMedBERT, MedCPT, BGE-large-en-v1.5) and three text formats (short, medium, long). Across the medium format (gene + variant + tumor type), 100% of clinically distinct variant pairs (9/9) had cosine similarity [≥] 0.95 under both biomedical encoders (PubMedBERT, MedCPT). The general-purpose encoder (BGE-large-en-v1.5) showed lower conflation in the medium format (11%) but rose to 100% with added clinical context. At the more stringent {tau} = 0.99 (averaged across formats), PubMedBERT conflated 56% of pairs and MedCPT conflated 22%. The biomedically pre-trained encoders performed worse, not better, than the general-purpose encoder. The typed-graph baseline achieves zero conflation by construction. We discuss the architectural implications: vector retrieval is appropriate for unstructured literature search but introduces unsafe ambiguity when used as the substrate for variant-level reasoning that drives drug-selection decisions. We argue that typed-graph retrieval should be the default architecture for any retrieval-grounded clinical decision support system that recommends targeted therapy.

Matching journals

The top 8 journals account for 50% of the predicted probability mass.

1
Bioinformatics
1061 papers in training set
Top 3%
10.5%
2
JCO Clinical Cancer Informatics
18 papers in training set
Top 0.1%
10.1%
3
Bioinformatics Advances
184 papers in training set
Top 0.3%
6.9%
4
PLOS Computational Biology
1633 papers in training set
Top 5%
6.9%
5
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.6%
4.4%
6
Cell Systems
167 papers in training set
Top 3%
4.3%
7
BMC Bioinformatics
383 papers in training set
Top 3%
3.6%
8
Genome Medicine
154 papers in training set
Top 2%
3.6%
50% of probability mass above
9
npj Digital Medicine
97 papers in training set
Top 1%
3.6%
10
Scientific Reports
3102 papers in training set
Top 41%
3.1%
11
Nature Communications
4913 papers in training set
Top 44%
2.7%
12
PLOS ONE
4510 papers in training set
Top 43%
2.7%
13
BioData Mining
15 papers in training set
Top 0.2%
2.4%
14
iScience
1063 papers in training set
Top 11%
1.9%
15
Nature Methods
336 papers in training set
Top 4%
1.8%
16
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 32%
1.7%
17
npj Precision Oncology
48 papers in training set
Top 0.6%
1.7%
18
Nature Machine Intelligence
61 papers in training set
Top 2%
1.5%
19
Nucleic Acids Research
1128 papers in training set
Top 13%
1.3%
20
The Lancet Digital Health
25 papers in training set
Top 0.7%
1.2%
21
NAR Genomics and Bioinformatics
214 papers in training set
Top 3%
0.9%
22
Artificial Intelligence in Medicine
15 papers in training set
Top 0.5%
0.9%
23
Frontiers in Genetics
197 papers in training set
Top 9%
0.8%
24
GigaScience
172 papers in training set
Top 3%
0.8%
25
Nature Medicine
117 papers in training set
Top 5%
0.8%
26
Nature Biotechnology
147 papers in training set
Top 8%
0.8%
27
Cancer Research
116 papers in training set
Top 4%
0.7%
28
Patterns
70 papers in training set
Top 3%
0.6%
29
Genome Biology
555 papers in training set
Top 9%
0.6%
30
IEEE Transactions on Computational Biology and Bioinformatics
17 papers in training set
Top 0.8%
0.6%