Back

KMer-Node2Vec: Learning Vector Representations of K-mers from the K-meGraph

Yu, Z.; Yang, Z.; Lan, Q.; Huang, F.; Cai, Y.

2022-09-01 bioinformatics
10.1101/2022.08.30.505832 bioRxiv
Show abstract

Learning low-dimensional continuous vector representation for short k-mers divided from long DNA sequences is key to DNA sequence modeling that can be utilized in many bioinformatics investigations, such as DNA sequence classification and retrieval. DNA2Vec is the most widely used method for DNA sequence embedding. However, it poorly scales to large data sets due to its extremely long training time in kmer embedding. In this paper, we propose a novel efficient graph-based kmer embedding method, named Kmer-Node2Vec, to tackle this concern. Our method converts the large DNA corpus into one kmer co-occurrence graph and extracts kmer relation on the graph by random walks to learn fast and high-quality kmer embedding. Extensive experiments show that our method is faster than DNA2Vec by 29 times for training on a 4GB data set, and on par with DNA2Vec in terms of task-specific accuracy of sequence retrieval and classification.

Matching journals

The top 6 journals account for 50% of the predicted probability mass.

1
Bioinformatics
1061 papers in training set
Top 1%
23.0%
2
Genomics, Proteomics & Bioinformatics
171 papers in training set
Top 0.7%
8.6%
3
Briefings in Bioinformatics
326 papers in training set
Top 0.8%
6.5%
4
IEEE Transactions on Computational Biology and Bioinformatics
17 papers in training set
Top 0.1%
6.5%
5
BMC Bioinformatics
383 papers in training set
Top 2%
5.0%
6
Nucleic Acids Research
1128 papers in training set
Top 4%
4.3%
50% of probability mass above
7
Bioinformatics Advances
184 papers in training set
Top 1.0%
4.1%
8
Nature Communications
4913 papers in training set
Top 38%
3.8%
9
Genome Biology
555 papers in training set
Top 2%
3.8%
10
Advanced Science
249 papers in training set
Top 7%
2.7%
11
Genome Research
409 papers in training set
Top 2%
2.1%
12
IEEE/ACM Transactions on Computational Biology and Bioinformatics
32 papers in training set
Top 0.2%
1.9%
13
iScience
1063 papers in training set
Top 14%
1.7%
14
PLOS Computational Biology
1633 papers in training set
Top 16%
1.7%
15
PLOS ONE
4510 papers in training set
Top 53%
1.7%
16
Journal of Computational Biology
37 papers in training set
Top 0.2%
1.7%
17
NAR Genomics and Bioinformatics
214 papers in training set
Top 2%
1.7%
18
Computational and Structural Biotechnology Journal
216 papers in training set
Top 6%
1.3%
19
IEEE Journal of Biomedical and Health Informatics
34 papers in training set
Top 2%
0.9%
20
Journal of Molecular Biology
217 papers in training set
Top 3%
0.9%
21
Frontiers in Genetics
197 papers in training set
Top 8%
0.9%
22
Communications Biology
886 papers in training set
Top 20%
0.8%
23
Cell Systems
167 papers in training set
Top 12%
0.8%
24
Nature Biotechnology
147 papers in training set
Top 7%
0.8%
25
Nature Computational Science
50 papers in training set
Top 2%
0.7%
26
Scientific Reports
3102 papers in training set
Top 75%
0.7%
27
Patterns
70 papers in training set
Top 3%
0.7%
28
Genes
126 papers in training set
Top 4%
0.7%
29
Journal of Chemical Information and Modeling
207 papers in training set
Top 3%
0.5%
30
Computers in Biology and Medicine
120 papers in training set
Top 6%
0.5%