Back

JRSeek: Artificial Intelligence Meets Jelly Roll Fold Classification in Viruses

Sanchez, J. E.; Guo, W.; Li, C.; Li, L. E.; Xiao, C.

2025-01-29 bioinformatics
10.1101/2025.01.27.635132 bioRxiv
Show abstract

The jelly roll (JR) fold is the most common structural motif found in the capsid and nucleocapsid of viruses. Its pervasiveness across many different viral families motives developing a tool to predict its presence from a sequence. In the current work, logistic regression (LR) models trained on six different large language model (LLM) embeddings exhibited over 95% accuracy in differentiating JR from non-JR sequences. The dataset used for training and testing included sequences from single JR viruses, non-JR viruses, and non-virus immunoglobulin-like {beta}-sandwich (IGLBS) proteins which closely resemble the JR fold in structure. The high accuracy is particularly remarkable given the low sequence similarity across viral families and the balanced nature of the dataset. Also, the accuracy of the models was independent of LLM embeddings, suggesting that peak accuracy for predicting viral JR folds hinges more on the data quality and quantity rather than on the specific mathematical models used. Given that many viral capsid and nucleocapsid structures have yet to be resolved, using sequence-based LLMs is a promising strategy that can readily be applied to available data. Principal Component Analysis of the Bert-U100 embeddings demonstrates that most IGLBS sequences and a subset of JR and non-JR sequences are distinguishable even before the application of the LR model, but the LR model is necessary to differentiate a subset of more ambiguous sequences. When applied to double JR folds, the Bert-U100 model was able to assign the JR motif for some viral families, providing evidence for the models generalizability. However, for other families, this generalizability was not observed, motivating a future need to develop other models informed by double JR folds. Lastly, the Bert-U100 model was also able to predict whether sequences from a dataset of unclassified viruses produce the JR fold. Two examples are given and the JR predictions are corroborated by AlphaFold3. Altogether, this work demonstrates that JR folds can, in principle, be predicted from their sequences.

Matching journals

The top 9 journals account for 50% of the predicted probability mass.

1
PLOS Computational Biology
1633 papers in training set
Top 3%
10.2%
2
Journal of Chemical Information and Modeling
207 papers in training set
Top 0.8%
6.9%
3
PLOS ONE
4510 papers in training set
Top 27%
6.4%
4
Frontiers in Genetics
197 papers in training set
Top 1%
4.9%
5
Scientific Reports
3102 papers in training set
Top 23%
4.9%
6
Computers in Biology and Medicine
120 papers in training set
Top 0.4%
4.9%
7
Frontiers in Bioinformatics
45 papers in training set
Top 0.1%
4.9%
8
Viruses
318 papers in training set
Top 1%
4.3%
9
Computational and Structural Biotechnology Journal
216 papers in training set
Top 1%
4.2%
50% of probability mass above
10
Bioinformatics Advances
184 papers in training set
Top 1%
3.6%
11
BMC Bioinformatics
383 papers in training set
Top 3%
3.6%
12
ImmunoInformatics
11 papers in training set
Top 0.1%
3.1%
13
Briefings in Bioinformatics
326 papers in training set
Top 3%
2.6%
14
Bioinformatics
1061 papers in training set
Top 6%
2.1%
15
Biology Methods and Protocols
53 papers in training set
Top 0.9%
1.7%
16
Frontiers in Immunology
586 papers in training set
Top 4%
1.7%
17
Computational Biology and Chemistry
23 papers in training set
Top 0.2%
1.2%
18
Journal of General Virology
46 papers in training set
Top 0.6%
1.1%
19
Frontiers in Molecular Biosciences
100 papers in training set
Top 3%
1.0%
20
PeerJ
261 papers in training set
Top 11%
1.0%
21
Biology
43 papers in training set
Top 2%
0.9%
22
GigaScience
172 papers in training set
Top 2%
0.9%
23
BMC Genomics
328 papers in training set
Top 6%
0.8%
24
BioSystems
11 papers in training set
Top 0.3%
0.8%
25
Journal of Biosciences
12 papers in training set
Top 0.2%
0.7%
26
Patterns
70 papers in training set
Top 3%
0.7%
27
Journal of Proteome Research
215 papers in training set
Top 2%
0.6%
28
Journal of Biomedical Informatics
45 papers in training set
Top 2%
0.6%
29
Physical Biology
43 papers in training set
Top 2%
0.6%
30
iScience
1063 papers in training set
Top 40%
0.5%