Improving viral protein clustering using both diversified protein profiles and structural information

Nugier, Q.; Bouras, G.; Galiez, C.; Petit, M.-A.; Enault, F.

2026-05-30 bioinformatics

10.64898/2026.05.26.727815 bioRxiv

Show abstract

Viruses are abundant, ancestral and potentially fast-evolving biological entities. As a result, their encoded proteins are diverse and identifying homologous relationships between sequences is as important for phylogeny and functional annotation as it is challenging. Traditional methods group viral proteins by sequence similarity, build HMM profiles for each protein family, and cluster further via profile comparisons. Here, we present an improved framework where HMM sensitivity is boosted by enriching reference virus HMM profiles with tens of millions of metagenomic sequences. This increases diversity within most protein families, raising the diversity index from less than 2 for 92.7% of clusters to a median value of 6. This enrichment of the profiles more than triples the number of homologies detected compared to the raw profiles. First-step clusters are then grouped more effectively using these relationships and further unified via structural predictions and comparisons. The sequence-enrichment strategy excels at linking small proteins, while structures better connect highly structured ones like tail and head proteins. Applied to 1.42 million proteins, our method yields 56,560 families--far fewer than 200,018 (sequence-based) or 135,048 (raw HMM)--revealing that prior approaches vastly overestimated viral protein diversity. The strategy of enriching the diversity of sequences of interest with external sequences, combined with the complementary use of structural information, highlights deep evolutionary links, offering a more accurate picture of viral protein evolution.

Improving viral protein clustering using both diversified protein profiles and structural information

Matching journals