Back

Bacteriophage host prediction using a genome language model

WANG, Z.; Arsuaga, J.

2026-03-20 bioinformatics
10.64898/2026.03.19.712863 bioRxiv
Show abstract

Computational bacteriophage host prediction from genomic sequences remains challenging because host range depends on diverse, rapidly evolving genomic determinants--from receptor-binding proteins to anti-defense systems and downstream infection compatibility--and because the signals available to predictors, including sequence homology, CRISPR spacer matches, nucleotide composition, and mobile genetic elements, are sparse, unevenly distributed across taxa, and constrained by incomplete host annotations. Here, we frame host prediction as an unsupervised retrieval problem. We asked whether embeddings from the pretrained genome language model Evo2 captured a reliable host-range signal without training on phage-host labels. We generated whole-genome embeddings for phages and candidate bacterial hosts with the Evo2-7B model, applied normalization, and ranked hosts by cosine similarity. Using the Virus-Host Database, we selected embedding and fusion choices on a Gram-positive validation cohort and then evaluated the approach on a held-out Gram-negative test cohort to minimize data leakage. We found that Evo2 was strongest at retrieving multiple plausible hosts, with the recorded host in the top 10 for 55.4% of phages. However, it did not maximize species-level top-1 accuracy (19.4% vs. 23.2% for the best baseline). At higher taxonomic ranks, Evo2 captured a coarser host-range signal: top-1 accuracy reached 43.4% at the genus level and 51.6% at the family level. Reciprocal rank fusion of Evo2 with BLASTN, VirHostMatcher, and PHIST improved all retrieval metrics. Top-10 retrieval rose to 58.5% and top-1 accuracy to 26.9%. Stratified analyses by phage genome length, host clade, and host mobile genetic element coverage revealed scenario-dependent performance. Evo2 embeddings excelled for intermediate-length phages and when host mobile element content was low, whereas alignment and k-mer methods dominated when local homology was abundant. These results suggest that pretrained genome embeddings complement established alignment- and k-mer/composition-based methods and that context-aware hybrid pipelines may help improve phage host prediction. Author summaryBacteriophages are viruses that prey on bacteria and play central roles in microbial ecosystems, nutrient cycling, and the spread of antibiotic resistance genes. Knowing which bacterium a phage can infect is important for applications such as phage therapy, where viruses are used to treat bacterial infections, but making this prediction from DNA sequence data alone remains difficult. Existing computational tools each exploit different types of genomic evidence, and none works reliably across all settings. We asked whether an artificial intelligence model trained to read raw DNA--without ever being shown which phages infect which hosts--could contribute a new, complementary signal. We found that this approach was particularly effective at narrowing the field to a short list of candidate hosts and at capturing broad evolutionary relationships between phages and bacteria. When we combined it with established sequence-comparison tools, overall prediction improved beyond what any single method achieved alone. By examining when each method succeeded or failed, we identified biological factors that govern prediction difficulty, offering practical guidance for building more robust prediction systems.

Matching journals

The top 5 journals account for 50% of the predicted probability mass.

1
PLOS Computational Biology
1633 papers in training set
Top 0.8%
22.2%
2
NAR Genomics and Bioinformatics
214 papers in training set
Top 0.1%
10.0%
3
Bioinformatics
1061 papers in training set
Top 3%
8.3%
4
Cell Systems
167 papers in training set
Top 2%
6.2%
5
mSystems
361 papers in training set
Top 2%
6.2%
50% of probability mass above
6
Bioinformatics Advances
184 papers in training set
Top 1.0%
4.1%
7
Genome Biology
555 papers in training set
Top 2%
3.9%
8
GigaScience
172 papers in training set
Top 0.5%
3.6%
9
Nature Biotechnology
147 papers in training set
Top 3%
2.8%
10
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 28%
2.0%
11
Nucleic Acids Research
1128 papers in training set
Top 11%
1.7%
12
BMC Bioinformatics
383 papers in training set
Top 5%
1.6%
13
Microbial Genomics
204 papers in training set
Top 1%
1.3%
14
BMC Biology
248 papers in training set
Top 2%
1.2%
15
Frontiers in Bioinformatics
45 papers in training set
Top 0.5%
1.2%
16
Scientific Reports
3102 papers in training set
Top 68%
1.1%
17
Frontiers in Genetics
197 papers in training set
Top 8%
0.9%
18
mSphere
281 papers in training set
Top 5%
0.9%
19
Microbiome
139 papers in training set
Top 2%
0.9%
20
Nature Communications
4913 papers in training set
Top 60%
0.9%
21
PeerJ
261 papers in training set
Top 13%
0.9%
22
mBio
750 papers in training set
Top 11%
0.8%
23
Patterns
70 papers in training set
Top 2%
0.8%
24
PLOS ONE
4510 papers in training set
Top 68%
0.7%
25
Cell Genomics
162 papers in training set
Top 7%
0.7%
26
BMC Genomics
328 papers in training set
Top 6%
0.7%
27
Cell Host & Microbe
113 papers in training set
Top 5%
0.7%
28
iScience
1063 papers in training set
Top 38%
0.6%
29
Molecular Ecology Resources
161 papers in training set
Top 1%
0.6%