Back

VIRALpre: Genomic Foundation Model Embedding Fused withK-mer Feature for Virus Identification

Wang, Z.; Yu, Q.; Li, Y.

2024-11-15 bioinformatics
10.1101/2024.11.12.623150 bioRxiv
Show abstract

Virus, a submicroscopic infectious agent, influences all life forms. Identifying viral sequences is essential to understand their biological functions and then analyze their impacts on public health, and the development of microbial communities. For its significance, tools are developed based on various mathematical methods and algorithms. However, previous methods struggle to identify viral sequences, especially short contigs accurately since the limited information and small-scale close-set dataset. Here we propose VIRALpre, a hybrid framework combined with genomic foundation model (GFM) embedding and K-mer feature of sequences to precisely recognize viral genomic fragments. VIRALpre is empowered by the generalization competencies of GFMs, which have proven their strength in various downstream tasks, thanks to newly established large-scale training databases and Attention mechanism. On the other hand, K-mer features provide additional biological information to bridge the limitation of GFMs in classification tasks. Comprehensive experimental results demonstrate that VIRALpre significantly outperforms all the previous methods on virus identification performance by 4% in accuracy. To prove that this model is qualified when facing unique contigs to training data, BLASTn-based similarity cut-off test(setting e-value as 10-5) is done and it achieves about 10% F1-score improvement. More than well-built test datasets, new zero-shot cross-dataset tests on benchmark datasets sampling from natural environments are conducted, VIRALpre performs identify almost most viral sequences while keeping a very low False Positive Rate. Based on these solid experiments, VIRALpre has the ability to manage short-contig virus identification by truly learning the distinctions of viral sequences and hopefully act as an adviser to promote virus-related research.

Matching journals

The top 5 journals account for 50% of the predicted probability mass.

1
Briefings in Bioinformatics
326 papers in training set
Top 0.1%
33.6%
2
Advanced Science
249 papers in training set
Top 3%
4.9%
3
PLOS Computational Biology
1633 papers in training set
Top 8%
4.0%
4
Computational and Structural Biotechnology Journal
216 papers in training set
Top 1%
4.0%
5
Bioinformatics
1061 papers in training set
Top 5%
3.9%
50% of probability mass above
6
Genomics, Proteomics & Bioinformatics
171 papers in training set
Top 2%
3.7%
7
Frontiers in Microbiology
375 papers in training set
Top 2%
3.7%
8
BMC Bioinformatics
383 papers in training set
Top 3%
3.1%
9
Computers in Biology and Medicine
120 papers in training set
Top 1%
2.4%
10
Quantitative Biology
11 papers in training set
Top 0.2%
1.7%
11
Science China Life Sciences
26 papers in training set
Top 0.9%
1.7%
12
PLOS ONE
4510 papers in training set
Top 56%
1.5%
13
mSystems
361 papers in training set
Top 5%
1.5%
14
Patterns
70 papers in training set
Top 1%
1.5%
15
IEEE Journal of Biomedical and Health Informatics
34 papers in training set
Top 1%
1.4%
16
Frontiers in Genetics
197 papers in training set
Top 6%
1.4%
17
Journal of Chemical Information and Modeling
207 papers in training set
Top 2%
1.4%
18
Nucleic Acids Research
1128 papers in training set
Top 14%
1.1%
19
Frontiers in Immunology
586 papers in training set
Top 6%
0.9%
20
IEEE/ACM Transactions on Computational Biology and Bioinformatics
32 papers in training set
Top 0.4%
0.9%
21
IEEE Transactions on Computational Biology and Bioinformatics
17 papers in training set
Top 0.6%
0.8%
22
Scientific Reports
3102 papers in training set
Top 74%
0.8%
23
mSphere
281 papers in training set
Top 6%
0.8%
24
Frontiers in Molecular Biosciences
100 papers in training set
Top 5%
0.8%
25
Nature Machine Intelligence
61 papers in training set
Top 4%
0.7%
26
Communications Biology
886 papers in training set
Top 28%
0.7%
27
National Science Review
22 papers in training set
Top 3%
0.5%
28
Journal of Genetics and Genomics
36 papers in training set
Top 3%
0.5%
29
GigaScience
172 papers in training set
Top 4%
0.5%
30
Journal of Bioinformatics and Systems Biology
14 papers in training set
Top 1%
0.5%