Back

Beyond Identifier Matching: An Empirical Characterization of Failure Modes in Biomedical Knowledge Graph Integration

Hu, S.; Cheng, H.; Gillenwater, L.; Manpearl, K.; Mandava, A.; Wang, Y.; Pividori, M.; Stranger, B.; Krishnan, A.; Greene, C.; Gao, Y.

2026-05-28 health informatics
10.64898/2026.05.26.26354182 medRxiv
Show abstract

Objective. Biomedical knowledge graphs (KGs) such as PrimeKG, Hetionet, UMLS, and PharmGKB are increasingly used as the substrate for downstream machine-learning, retrieval-augmented generation, drug-repurposing, and electronic health record (EHR) augmentation pipelines. The dominant assumption in published work is that integrating two or more such KGs is a tractable engineering step solved by identifier (ID) matching. This paper interrogates that assumption empirically. We quantify how much concept overlap survives realistic alignment, and we characterize the new failure modes introduced by the methods that practitioners reach for when ID matching is insufficient. Materials and Methods. We compared four widely used biomedical KGs (PrimeKG, Hetionet v1.0, the full UMLS Metathesaurus, and PharmGKB) across eleven node types using a tiered alignment pipeline: (1) direct ID matching for nodes sharing a primary vocabulary; (2) cross-ontology bridging using standard mappings (e.g., MONDO-DOID, HPO-UMLS, HPO-UMLS-MeSH for side effects, NCBI Gene-HGNC-UMLS, UBERON-FMA/SNOMEDCT_US/NCI/MeSH for anatomy); (3) ClinicalBERT cosine-similarity grouping at threshold >= 0.98 for over-segmented disease nodes, with a deterministic suffix-stripping canonicalizer; (4) exact name matching for ontology-poor types (anatomy, REACTOME pathways); and (5) embedding-based fuzzy matching with UMLS lookup (SapBERT and ClinicalBERT) for free-text microbiome concepts. We applied the pipeline to a 698-concept gut-microbiome benchmark spanning taxa, pathways, and disease labels, validated grouping decisions against the curated SSSOM mappings released by the MONDO project, and audited the ClinicalBERT consolidation against five clinical-genetics case studies drawn from the literature. Results. Per-type pairwise coverage was strikingly asymmetric. Genes/proteins and the three Gene Ontology categories aligned cleanly across PrimeKG and Hetionet (mutual coverage 94-99%), but disease overlap was sparse: only 0.7% of PrimeKG individual disease nodes mapped to Hetionet, rising to 2.0% after MONDO grouping (versus 78.7% and 18.4% from the Hetionet side). PrimeKG-to-UMLS coverage spanned 100% (effect/phenotype via HPO) down to 20.8% (REACTOME pathways), with drugs at 73.7% and anatomy at 58.8%. PrimeKG-to-PharmGKB drug coverage required up to two bridging hops (DrugBank -> UMLS -> RxNorm/ATC/MeSH). Bigger was not uniformly more complete: on a 698-concept microbiome drug benchmark, Hetionet missed 0 concepts while PrimeKG missed 16. ClinicalBERT-based grouping consolidated 22,205 raw MONDO disease nodes into 17,080 groups but introduced three reproducible failure modes documented in case studies: (i) peer over-merging: for example, all 22 osteogenesis imperfecta subtypes collapsed into a single node despite distinct severity classes; (ii) parent-child collapse: e.g. acute myeloid leukemia merged with myeloid leukemia, erasing the acute/chronic distinction that drives clinical management; and (iii) lexical false positives: neurofibromatosis and schwannomatosis grouped together despite cellular-pathology differences. Discussion. Identifier matching alone is a weak baseline for biomedical KG integration. Cross-ontology bridges and embedding-based consolidation expand coverage but do so at the cost of clinically meaningful resolution, and the resulting failures are systematic rather than random. Reporting only aggregate coverage statistics obscures these losses, which propagate silently into downstream tasks. Conclusion. We provide reusable per-type coverage tables, a taxonomy of three integration failure modes, and concrete recommendations for downstream studies that depend on a unified biomedical KG. We argue that future KG integration work should report per-type coverage and per-cluster confidence rather than aggregate match rates.

Matching journals

The top 14 journals account for 50% of the predicted probability mass.

1
Bioinformatics
1061 papers in training set
Top 4%
6.5%
2
Nature Communications
4913 papers in training set
Top 28%
6.4%
3
Med
38 papers in training set
Top 0.1%
4.4%
4
Scientific Reports
3102 papers in training set
Top 33%
3.7%
5
JCO Clinical Cancer Informatics
18 papers in training set
Top 0.2%
3.7%
6
NAR Genomics and Bioinformatics
214 papers in training set
Top 0.6%
3.7%
7
eBioMedicine
130 papers in training set
Top 0.3%
3.7%
8
Nature Medicine
117 papers in training set
Top 0.8%
3.7%
9
BMC Bioinformatics
383 papers in training set
Top 3%
3.3%
10
PLOS ONE
4510 papers in training set
Top 43%
2.8%
11
GENETICS
189 papers in training set
Top 0.3%
2.8%
12
Nature Biotechnology
147 papers in training set
Top 4%
2.4%
13
Science Translational Medicine
111 papers in training set
Top 2%
2.1%
14
The Lancet Digital Health
25 papers in training set
Top 0.2%
2.1%
50% of probability mass above
15
Computers in Biology and Medicine
120 papers in training set
Top 1%
2.1%
16
Patterns
70 papers in training set
Top 0.7%
1.8%
17
Journal of the American Medical Informatics Association
61 papers in training set
Top 1%
1.7%
18
PLOS Computational Biology
1633 papers in training set
Top 16%
1.7%
19
Cell Reports Medicine
140 papers in training set
Top 3%
1.7%
20
npj Digital Medicine
97 papers in training set
Top 2%
1.7%
21
Bioinformatics Advances
184 papers in training set
Top 3%
1.5%
22
Briefings in Bioinformatics
326 papers in training set
Top 4%
1.5%
23
Scientific Data
174 papers in training set
Top 1%
1.5%
24
Nucleic Acids Research
1128 papers in training set
Top 13%
1.2%
25
BMC Medical Genomics
36 papers in training set
Top 0.9%
0.9%
26
iScience
1063 papers in training set
Top 26%
0.9%
27
Life Science Alliance
263 papers in training set
Top 1%
0.8%
28
Communications Medicine
85 papers in training set
Top 0.9%
0.8%
29
PLOS Digital Health
91 papers in training set
Top 3%
0.8%
30
Molecular Systems Biology
142 papers in training set
Top 2%
0.8%