Back

Protein-level prediction of Klebsiella phage adsorption identifies conserved receptor-binding motifs.

Fumagalli, F.; Spigler, G.

2026-05-23 bioinformatics
10.64898/2026.05.21.726843 bioRxiv
Show abstract

Bacteriophage therapy offers a potential route to treat antibiotic-resistant Klebsiella pneumoniae infections, but its use is limited by the narrow specificity of phage-host interactions. In Klebsiella, adsorption is largely determined by receptor-binding proteins (RBPs) that recognize bacterial capsular polysaccharides, yet current machine learning approaches often represent whole phages rather than the individual proteins that mediate recognition. Here, we ask whether adsorption can be predicted at the level of single RBPs and whether the resulting models can identify the molecular features responsible for host specificity. Using experimentally validated Klebsiella phage-host interactions, we extended the PhageHostLearn framework from averaged phage-level representations to individual RBP-level predictions. We found that single-RBP models recover the predictive performance of strain-level models when host capsule identity is explicitly represented. However, models trained only on interaction-level labels did not reliably distinguish motif-bearing RBPs from other viral proteins, indicating that protein-level inputs alone are insufficient for mechanistic interpretability. To resolve this ambiguity, we identified serotype-specific conserved motifs among RBPs from phages infecting the same capsular type. Structural modelling showed that these motifs localize to exposed regions of RBPs and resemble carbohydrate-binding modules. Incorporating motif information into a relabelled training scheme improved prioritization of motif-bearing RBPs while preserving interaction-level predictive power. We further identified a candidate multi-motif RBP from phage S8c that may recognize multiple capsular serotypes. Together, these results support a modular model of Klebsiella phage adsorption in which conserved sub-protein elements drive capsule recognition. More broadly, this work shows how protein-level machine learning combined with biological constraints can move beyond accurate phage-host prediction toward mechanistic identification of host-range determinants. Author summaryBacteriophages -viruses that infect bacteria- are being explored as alternatives to antibiotics, especially against drug-resistant pathogens such as Klebsiella pneumoniae. The challenge is specificity: each phage attaches to only a narrow range of bacterial strains, recognising them through proteins on its tail that bind the bacteriums protective sugar capsule. Choosing or engineering the right phage for a given infection therefore requires understanding what these recognition proteins actually do. We asked whether a machine learning model could move beyond predicting which phages infect a given strain and start identifying which protein on the phage drives that recognition. Prediction alone, we found, is not enough: a model can be accurate without pointing to the responsible protein. To bridge this gap, we searched for short shared sequences among recognition proteins from phages that infect bacteria with the same capsule type, and used these shared patterns to guide the model. This combination correctly prioritised the recognition protein far more often than chance. One phage protein, from phage S8c, carried patterns matching five different capsule types, suggesting a candidate broadly-recognising protein for future experimental study.

Matching journals

The top 4 journals account for 50% of the predicted probability mass.

1
PLOS Computational Biology
1633 papers in training set
Top 0.1%
37.8%
2
mSystems
361 papers in training set
Top 1.0%
8.4%
3
Cell Systems
167 papers in training set
Top 4%
3.7%
4
NAR Genomics and Bioinformatics
214 papers in training set
Top 0.7%
3.6%
50% of probability mass above
5
mBio
750 papers in training set
Top 4%
3.6%
6
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 30%
1.9%
7
eLife
5422 papers in training set
Top 45%
1.5%
8
Nature Communications
4913 papers in training set
Top 54%
1.5%
9
Molecular Biology and Evolution
488 papers in training set
Top 3%
1.5%
10
Cell Host & Microbe
113 papers in training set
Top 3%
1.5%
11
PLOS Biology
408 papers in training set
Top 12%
1.3%
12
iScience
1063 papers in training set
Top 21%
1.2%
13
Proteins: Structure, Function, and Bioinformatics
82 papers in training set
Top 0.6%
1.2%
14
Science Advances
1098 papers in training set
Top 25%
1.0%
15
Frontiers in Genetics
197 papers in training set
Top 7%
1.0%
16
Molecular Systems Biology
142 papers in training set
Top 1%
1.0%
17
Scientific Reports
3102 papers in training set
Top 69%
1.0%
18
Microbial Genomics
204 papers in training set
Top 2%
0.9%
19
Nucleic Acids Research
1128 papers in training set
Top 16%
0.9%
20
Cell Reports
1338 papers in training set
Top 31%
0.9%
21
Bioinformatics
1061 papers in training set
Top 9%
0.8%
22
PeerJ
261 papers in training set
Top 15%
0.7%
23
Frontiers in Microbiology
375 papers in training set
Top 9%
0.7%
24
BMC Genomics
328 papers in training set
Top 6%
0.7%
25
BMC Bioinformatics
383 papers in training set
Top 7%
0.7%
26
Ecology Letters
121 papers in training set
Top 1%
0.7%
27
Cell
370 papers in training set
Top 18%
0.7%
28
Proceedings of the Royal Society B: Biological Sciences
341 papers in training set
Top 7%
0.7%
29
Microbiology
57 papers in training set
Top 1%
0.7%
30
Antibiotics
32 papers in training set
Top 1%
0.7%