Back

CROssBARv2: A Unified Computational Framework for Heterogeneous Biomedical Data Representation and LLM-Driven Exploration

Sen, B.; Ulusoy, E.; Darcan, M.; Ergun, M.; Lobentanzer, S.; Rifaioglu, A. S.; Turei, D.; Saez-Rodriguez, J.; Dogan, T.

2026-04-15 bioinformatics
10.64898/2026.04.12.718028 bioRxiv
Show abstract

Biomedical discovery is hindered by fragmented, modality-specific repositories and uneven metadata, limiting integrative analysis, accessibility, and reproducibility. To address these challenges, we present CROssBARv2, a provenance-rich biomedical data-and-knowledge integration platform that unifies heterogeneous sources into a maintainable, scalable system. By consolidating diverse data types into an extensive knowledge graph enriched with standardised ontologies, rich metadata, and deep learning-based vector embeddings, CROssBARv2 alleviates the need for researchers to navigate multiple siloed databases and can facilitate downstream tasks, including predictive modelling and mechanistic reasoning, enabling applications such as drug repurposing and protein function prediction. The platform offers interactive graph exploration and embedding-based semantic search with CROssBAR-LLM, an intuitive natural language question-answering system that grounds large language model (LLM) outputs in the underlying knowledge graph to mitigate hallucinations. We assess CROssBARv2 through (i) multiple use-case analyses to test biological coherence and relational validity; (ii) knowledge-augmented biomedical question-answering benchmarks comparing CROssBAR-LLM against generalist LLMs; and (iii) a deep learning-based predictive modelling experiment for protein function prediction leveraging the heterogeneous structure of CROssBARv2. Collectively, CROssBARv2 provides a scalable, AI-ready, and user-friendly foundation that facilitates hypothesis generation, knowledge discovery, and translational research.

Matching journals

The top 8 journals account for 50% of the predicted probability mass.

1
Nature Methods
336 papers in training set
Top 0.7%
12.5%
2
Bioinformatics
1061 papers in training set
Top 3%
10.1%
3
Nucleic Acids Research
1128 papers in training set
Top 2%
7.2%
4
Nature Biotechnology
147 papers in training set
Top 1%
6.8%
5
Bioinformatics Advances
184 papers in training set
Top 0.9%
4.3%
6
Nature
575 papers in training set
Top 6%
4.0%
7
Genome Medicine
154 papers in training set
Top 2%
4.0%
8
Advanced Science
249 papers in training set
Top 5%
3.6%
50% of probability mass above
9
Nature Communications
4913 papers in training set
Top 40%
3.6%
10
Cell Systems
167 papers in training set
Top 4%
3.6%
11
Patterns
70 papers in training set
Top 0.2%
3.6%
12
npj Digital Medicine
97 papers in training set
Top 2%
2.4%
13
Genome Biology
555 papers in training set
Top 3%
2.1%
14
Nature Machine Intelligence
61 papers in training set
Top 2%
1.9%
15
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 31%
1.8%
16
GigaScience
172 papers in training set
Top 2%
1.5%
17
Computational and Structural Biotechnology Journal
216 papers in training set
Top 6%
1.3%
18
PLOS ONE
4510 papers in training set
Top 58%
1.3%
19
Cell Genomics
162 papers in training set
Top 4%
1.2%
20
Database
51 papers in training set
Top 0.6%
1.2%
21
NAR Genomics and Bioinformatics
214 papers in training set
Top 3%
1.2%
22
Briefings in Bioinformatics
326 papers in training set
Top 5%
1.2%
23
PLOS Computational Biology
1633 papers in training set
Top 21%
0.9%
24
Nature Computational Science
50 papers in training set
Top 1%
0.9%
25
Nature Medicine
117 papers in training set
Top 4%
0.8%
26
Scientific Reports
3102 papers in training set
Top 73%
0.8%
27
Nature Genetics
240 papers in training set
Top 7%
0.7%
28
BMC Bioinformatics
383 papers in training set
Top 8%
0.6%
29
eLife
5422 papers in training set
Top 61%
0.6%