Back

FastFeatGen: Faster parallel feature extraction from genome sequences and efficient prediction of DNA N6-methyladenine sites

Rahman, M. K.

2019-11-18 bioinformatics
10.1101/846311 bioRxiv
Show abstract

N6-methyladenine is widely found in both prokaryotes and eukaryotes. It is responsible for many biological processes including prokaryotic defense system and human diseases. So, it is important to know its correct location in genome which may play a significant role in different biological functions. Few computational tools exist to serve this purpose but they are computationally expensive and still there is scope to improve accuracy. An informative feature extraction pipeline from genome sequences is the heart of these tools as well as for many other bioinformatics tools. But it becomes reasonably expensive for sequential approaches when the size of data is large. Hence, a scalable parallel approach is highly desirable. In this paper, we have developed a new tool, called FastFeatGen, emphasizing both developing a parallel feature extraction technique and improving accuracy using machine learning methods. We have implemented our feature extraction approach using shared memory parallelism which achieves around 10x speed over the sequential one. Then we have employed an exploratory feature selection technique which helps to find more relevant features that can be fed to machine learning methods. We have employed Extra-Tree Classifier (ETC) in FastFeatGen and performed experiments on rice and mouse genomes. Our experimental results achieve accuracy of 85.57% and 96.64%, respectively, which are better or competitive to current state-of-the-art methods. Our shared memory based tool can also serve queries much faster than sequential technique. All source codes and datasets are available at https://github.com/khaled-rahman/FastFeatGen.

Matching journals

The top 4 journals account for 50% of the predicted probability mass.

1
Bioinformatics
1061 papers in training set
Top 0.9%
23.5%
2
BMC Bioinformatics
383 papers in training set
Top 0.2%
19.5%
3
PLOS ONE
4510 papers in training set
Top 26%
6.6%
4
IEEE/ACM Transactions on Computational Biology and Bioinformatics
32 papers in training set
Top 0.1%
6.6%
50% of probability mass above
5
PLOS Computational Biology
1633 papers in training set
Top 9%
3.8%
6
Bioinformatics Advances
184 papers in training set
Top 1%
3.7%
7
Briefings in Bioinformatics
326 papers in training set
Top 2%
3.2%
8
GigaScience
172 papers in training set
Top 0.8%
2.6%
9
IEEE Transactions on Computational Biology and Bioinformatics
17 papers in training set
Top 0.2%
1.8%
10
Frontiers in Bioinformatics
45 papers in training set
Top 0.2%
1.8%
11
BioData Mining
15 papers in training set
Top 0.3%
1.7%
12
Gigabyte
60 papers in training set
Top 0.7%
1.5%
13
Journal of Computational Biology
37 papers in training set
Top 0.3%
1.4%
14
Frontiers in Genetics
197 papers in training set
Top 6%
1.3%
15
Genomics, Proteomics & Bioinformatics
171 papers in training set
Top 5%
1.0%
16
NAR Genomics and Bioinformatics
214 papers in training set
Top 3%
1.0%
17
PeerJ
261 papers in training set
Top 12%
0.9%
18
Computational and Structural Biotechnology Journal
216 papers in training set
Top 8%
0.8%
19
Scientific Reports
3102 papers in training set
Top 72%
0.8%
20
Genome Biology
555 papers in training set
Top 7%
0.8%
21
Journal of Bioinformatics and Systems Biology
14 papers in training set
Top 0.6%
0.8%
22
BMC Genomics
328 papers in training set
Top 6%
0.7%
23
F1000Research
79 papers in training set
Top 5%
0.7%
24
Journal of Proteome Research
215 papers in training set
Top 2%
0.7%
25
Genomics
60 papers in training set
Top 3%
0.5%
26
Journal of Chemical Information and Modeling
207 papers in training set
Top 3%
0.5%
27
BMC Medical Genomics
36 papers in training set
Top 2%
0.5%
28
Journal of Molecular Biology
217 papers in training set
Top 5%
0.5%
29
Frontiers in Molecular Biosciences
100 papers in training set
Top 7%
0.5%
30
iScience
1063 papers in training set
Top 39%
0.5%