Back

FoundedPBI: Using Genomic Foundation Models to predict Phage-Bacterium Interactions

Carrillo Barrera, P.; Babey, A.; Pena, C. A.

2026-03-26 bioinformatics
10.64898/2026.03.24.713871 bioRxiv
Show abstract

The scalability of phage therapy as a viable alternative or complement to antibiotics is limited by the labor-intensive experimental screening required to identify compatible phage-bacterium pairs. To accelerate this discovery process, we propose FoundedPBI, an ensemble deep learning approach that leverages the emergent capabilities of genomic foundation models, large language models pre-trained on vast DNA corpuses to predict phage-bacterium interactions from DNA sequences alone. We employ an ensemble strategy that aggregates outputs from three state-of-the-art DNA language models into a unified meta-embedding, which is then processed by a neural classifier. Our approach makes two key contributions: (1) We demonstrate that performing ensemble learning across models trained on different genomic data--i.e., prokaryotic (Nucleotide Transformer v2, DNABERT-2) and bacteriophage (MegaDNA) genomes--captures partially-orthogonal biological signals, yielding 6% F1-score improvement over the best individual model. (2) We adapt long-context NLP aggregation strategies to handle whole bacterial and phage genomes (up to 5M base pairs) that exceed the foundation models context windows (12-96K bp) by a factor of 50-100, a critical challenge largely unaddressed in prior genomic deep learning work. On the PredPHI benchmark, FoundedPBI achieves a 76% F1-score outperforming the current state-of-the-art (PBIP) by 7%. On our internal dataset (CI4CB), we achieve 93% F1-score, improving our previous best methods by 4%. These results demonstrate that ensemble learning with proper long-context handling enables effective knowledge transfer of genomic foundation models to specialized prediction tasks.

Matching journals

The top 6 journals account for 50% of the predicted probability mass.

1
Cell Systems
167 papers in training set
Top 0.8%
12.1%
2
Nature Biotechnology
147 papers in training set
Top 0.5%
12.1%
3
Nature Machine Intelligence
61 papers in training set
Top 0.3%
8.3%
4
Nucleic Acids Research
1128 papers in training set
Top 3%
6.7%
5
Nature Methods
336 papers in training set
Top 2%
6.2%
6
Nature Communications
4913 papers in training set
Top 30%
6.2%
50% of probability mass above
7
Bioinformatics Advances
184 papers in training set
Top 0.8%
4.8%
8
Bioinformatics
1061 papers in training set
Top 5%
3.9%
9
Advanced Science
249 papers in training set
Top 6%
3.5%
10
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 24%
2.7%
11
Cell Genomics
162 papers in training set
Top 2%
2.6%
12
Genome Research
409 papers in training set
Top 2%
2.0%
13
Nature
575 papers in training set
Top 10%
1.9%
14
Genome Biology
555 papers in training set
Top 4%
1.7%
15
Briefings in Bioinformatics
326 papers in training set
Top 4%
1.5%
16
Scientific Reports
3102 papers in training set
Top 65%
1.3%
17
Nature Biomedical Engineering
42 papers in training set
Top 1%
1.2%
18
npj Digital Medicine
97 papers in training set
Top 3%
1.2%
19
Genome Medicine
154 papers in training set
Top 6%
1.2%
20
Patterns
70 papers in training set
Top 2%
1.1%
21
Frontiers in Genetics
197 papers in training set
Top 8%
0.9%
22
Nature Computational Science
50 papers in training set
Top 1%
0.9%
23
Nature Medicine
117 papers in training set
Top 4%
0.9%
24
NAR Genomics and Bioinformatics
214 papers in training set
Top 3%
0.9%
25
Science
429 papers in training set
Top 18%
0.9%
26
Nature Genetics
240 papers in training set
Top 7%
0.7%
27
iScience
1063 papers in training set
Top 33%
0.7%
28
IEEE Transactions on Computational Biology and Bioinformatics
17 papers in training set
Top 0.7%
0.7%
29
PLOS ONE
4510 papers in training set
Top 70%
0.7%
30
PLOS Computational Biology
1633 papers in training set
Top 28%
0.6%