JRSeek: Artificial Intelligence Meets Jelly Roll Fold Classification in Viruses

Sanchez, J. E.; Guo, W.; Li, C.; Li, L. E.; Xiao, C.

2025-01-29 bioinformatics

10.1101/2025.01.27.635132 bioRxiv

Show abstract

The jelly roll (JR) fold is the most common structural motif found in the capsid and nucleocapsid of viruses. Its pervasiveness across many different viral families motives developing a tool to predict its presence from a sequence. In the current work, logistic regression (LR) models trained on six different large language model (LLM) embeddings exhibited over 95% accuracy in differentiating JR from non-JR sequences. The dataset used for training and testing included sequences from single JR viruses, non-JR viruses, and non-virus immunoglobulin-like {beta}-sandwich (IGLBS) proteins which closely resemble the JR fold in structure. The high accuracy is particularly remarkable given the low sequence similarity across viral families and the balanced nature of the dataset. Also, the accuracy of the models was independent of LLM embeddings, suggesting that peak accuracy for predicting viral JR folds hinges more on the data quality and quantity rather than on the specific mathematical models used. Given that many viral capsid and nucleocapsid structures have yet to be resolved, using sequence-based LLMs is a promising strategy that can readily be applied to available data. Principal Component Analysis of the Bert-U100 embeddings demonstrates that most IGLBS sequences and a subset of JR and non-JR sequences are distinguishable even before the application of the LR model, but the LR model is necessary to differentiate a subset of more ambiguous sequences. When applied to double JR folds, the Bert-U100 model was able to assign the JR motif for some viral families, providing evidence for the models generalizability. However, for other families, this generalizability was not observed, motivating a future need to develop other models informed by double JR folds. Lastly, the Bert-U100 model was also able to predict whether sequences from a dataset of unclassified viruses produce the JR fold. Two examples are given and the JR predictions are corroborated by AlphaFold3. Altogether, this work demonstrates that JR folds can, in principle, be predicted from their sequences.

JRSeek: Artificial Intelligence Meets Jelly Roll Fold Classification in Viruses

Matching journals