Back

Improving viral protein clustering using both diversified protein profiles and structural information

Nugier, Q.; Bouras, G.; Galiez, C.; Petit, M.-A.; Enault, F.

2026-05-30 bioinformatics
10.64898/2026.05.26.727815 bioRxiv
Show abstract

Viruses are abundant, ancestral and potentially fast-evolving biological entities. As a result, their encoded proteins are diverse and identifying homologous relationships between sequences is as important for phylogeny and functional annotation as it is challenging. Traditional methods group viral proteins by sequence similarity, build HMM profiles for each protein family, and cluster further via profile comparisons. Here, we present an improved framework where HMM sensitivity is boosted by enriching reference virus HMM profiles with tens of millions of metagenomic sequences. This increases diversity within most protein families, raising the diversity index from less than 2 for 92.7% of clusters to a median value of 6. This enrichment of the profiles more than triples the number of homologies detected compared to the raw profiles. First-step clusters are then grouped more effectively using these relationships and further unified via structural predictions and comparisons. The sequence-enrichment strategy excels at linking small proteins, while structures better connect highly structured ones like tail and head proteins. Applied to 1.42 million proteins, our method yields 56,560 families--far fewer than 200,018 (sequence-based) or 135,048 (raw HMM)--revealing that prior approaches vastly overestimated viral protein diversity. The strategy of enriching the diversity of sequences of interest with external sequences, combined with the complementary use of structural information, highlights deep evolutionary links, offering a more accurate picture of viral protein evolution.

Matching journals

The top 7 journals account for 50% of the predicted probability mass.

1
Bioinformatics
1061 papers in training set
Top 2%
12.5%
2
Nature Communications
4913 papers in training set
Top 18%
10.1%
3
PLOS Computational Biology
1633 papers in training set
Top 4%
8.4%
4
BMC Bioinformatics
383 papers in training set
Top 1%
6.8%
5
Nucleic Acids Research
1128 papers in training set
Top 3%
6.4%
6
Virus Evolution
140 papers in training set
Top 0.3%
4.9%
7
Cell Systems
167 papers in training set
Top 3%
4.9%
50% of probability mass above
8
NAR Genomics and Bioinformatics
214 papers in training set
Top 1%
2.7%
9
Nature Biotechnology
147 papers in training set
Top 3%
2.7%
10
Communications Biology
886 papers in training set
Top 4%
2.5%
11
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 28%
2.1%
12
Cell Reports Methods
141 papers in training set
Top 2%
2.1%
13
Scientific Reports
3102 papers in training set
Top 53%
1.9%
14
Nature Methods
336 papers in training set
Top 4%
1.9%
15
Genome Biology
555 papers in training set
Top 4%
1.8%
16
Bioinformatics Advances
184 papers in training set
Top 3%
1.7%
17
Journal of Chemical Information and Modeling
207 papers in training set
Top 2%
1.5%
18
Briefings in Bioinformatics
326 papers in training set
Top 5%
1.3%
19
PLOS ONE
4510 papers in training set
Top 58%
1.3%
20
Journal of Molecular Biology
217 papers in training set
Top 2%
1.3%
21
Molecular Biology and Evolution
488 papers in training set
Top 3%
1.2%
22
iScience
1063 papers in training set
Top 21%
1.2%
23
Advanced Science
249 papers in training set
Top 17%
0.9%
24
Viruses
318 papers in training set
Top 4%
0.9%
25
Microbiome
139 papers in training set
Top 3%
0.8%
26
mSphere
281 papers in training set
Top 5%
0.8%
27
GigaScience
172 papers in training set
Top 3%
0.7%
28
Nature Computational Science
50 papers in training set
Top 2%
0.7%
29
Computational and Structural Biotechnology Journal
216 papers in training set
Top 11%
0.6%
30
Computers in Biology and Medicine
120 papers in training set
Top 5%
0.6%