Back

Bio-BLIP: A Multimodal Architecture for Transferable Reasoning in Genomic Variant Interpretation

Gupta, A.; Buendia, A.; Kundaje, A.; Leskovec, J.

2026-05-15 genomics
10.64898/2026.05.12.724740 bioRxiv
Show abstract

Developing scientific hypotheses in biology requires integrating heterogeneous evidence across DNA sequence, gene context, protein function, and prior literature. Existing multimodal AI systems expose biological evidence to reasoning models through textification or by projecting biological embeddings into fine-tuned language models. However, these models are typically highly optimized the specific set of tasks for which they are fine-tuned. Here we present Bio-BLIP, a multimodal Q-former based architecture which leverages biological embeddings and a LLM to generalize to complex reasoning tasks without task-specific fine-tuning. The key to Bio-BLIP is a new neural network architecture that integrates four data modalities - DNA, genes, proteins, and text - through a master Qformer model, which integrates the modality-specific information into a fixed-length prefix for the LLM backbone. Bio-BLIP is pretrained on the task of human genetic variant annotation and achieves a 29.8% increase in generating accurate variant features over frontier LLMs. We evaluate Bio-BLIP zero-shot on downstream genomic tasks of variant prioritization and target gene prediction. Bio-BLIP outperforms two alignment-free genomic language models on regulatory variant prioritization for Mendelian disease. Across the target gene prediction task, Bio-BLIP improves accuracy over LLMs by leveraging learned genomic variant knowledge in difficult cases. Our model produces rich, transparent reasoning traces. In biological domains characterized by multiple scales of data and varied downstream tasks, Bio-BLIP offers a step toward natively multimodal, generalizable reasoning.

Matching journals

The top 7 journals account for 50% of the predicted probability mass.

1
Nature Machine Intelligence
61 papers in training set
Top 0.1%
12.3%
2
Nature Methods
336 papers in training set
Top 1%
10.0%
3
Science
429 papers in training set
Top 4%
8.3%
4
Nature
575 papers in training set
Top 4%
8.3%
5
Nature Biotechnology
147 papers in training set
Top 1%
6.3%
6
Genome Biology
555 papers in training set
Top 2%
4.2%
7
Cell Genomics
162 papers in training set
Top 1%
3.9%
50% of probability mass above
8
Bioinformatics
1061 papers in training set
Top 5%
3.9%
9
Nature Communications
4913 papers in training set
Top 38%
3.8%
10
Nature Medicine
117 papers in training set
Top 1%
2.7%
11
Genome Research
409 papers in training set
Top 2%
2.4%
12
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 26%
2.4%
13
Cell
370 papers in training set
Top 9%
2.3%
14
Nature Neuroscience
216 papers in training set
Top 4%
2.0%
15
Genome Medicine
154 papers in training set
Top 4%
2.0%
16
Bioinformatics Advances
184 papers in training set
Top 2%
1.9%
17
Nature Computational Science
50 papers in training set
Top 0.5%
1.9%
18
Cell Systems
167 papers in training set
Top 7%
1.7%
19
Nature Genetics
240 papers in training set
Top 4%
1.7%
20
npj Digital Medicine
97 papers in training set
Top 3%
1.3%
21
Nucleic Acids Research
1128 papers in training set
Top 14%
1.2%
22
Frontiers in Genetics
197 papers in training set
Top 7%
1.2%
23
Nature Human Behaviour
85 papers in training set
Top 3%
0.9%
24
Scientific Reports
3102 papers in training set
Top 73%
0.8%
25
eLife
5422 papers in training set
Top 59%
0.7%
26
GigaScience
172 papers in training set
Top 3%
0.7%
27
iScience
1063 papers in training set
Top 33%
0.7%
28
PLOS ONE
4510 papers in training set
Top 70%
0.7%
29
The American Journal of Human Genetics
206 papers in training set
Top 4%
0.6%