Back

A Convolutional Deep Learning Approach to identify DNA Sequences for Gene Prediction

Motta, J. A.; Gomez, P. D.

2025-10-06 bioinformatics
10.1101/2025.10.03.680292 bioRxiv
Show abstract

In this work, we present a highly efficient machine learning method for identifying DNA sequences that code for genes. The learning process is based on Human Genome Build 38 (GRCh38) sequences extracted from various specialized databases. The sequences were then translated into amino acid sequences and used to build matrices that facilitate the extraction of features with the TF*IDF metric for the creation of the training space. The prediction functions are learned using a convolutional neural network (CNN) deep learning model. The training spaces were created using the 24 chromosomes of the human genome and approximately 36,000 genes and pseudogenes whose names were fetched from the HUGO Gene Nomenclature Committee (HGNC). Performance analysis was performed on 24 genes associated with genetic disorders, as well as the surrounding DNA regions. The metrics used were precision, recall, F_score measure, accuracy and ROC curves for the genes of interest. The results achieved exceed all our expectations and place the work at the level of the state of the art for gene prediction.

Matching journals

The top 6 journals account for 50% of the predicted probability mass.

1
BMC Bioinformatics
383 papers in training set
Top 0.4%
17.4%
2
Scientific Reports
3102 papers in training set
Top 8%
9.1%
3
PLOS ONE
4510 papers in training set
Top 22%
8.4%
4
IEEE/ACM Transactions on Computational Biology and Bioinformatics
32 papers in training set
Top 0.1%
6.8%
5
Bioinformatics
1061 papers in training set
Top 4%
6.3%
6
Genes
126 papers in training set
Top 0.3%
3.6%
50% of probability mass above
7
Briefings in Bioinformatics
326 papers in training set
Top 2%
3.2%
8
BioData Mining
15 papers in training set
Top 0.1%
3.1%
9
Frontiers in Genetics
197 papers in training set
Top 3%
2.4%
10
Frontiers in Bioinformatics
45 papers in training set
Top 0.1%
2.1%
11
Informatics in Medicine Unlocked
21 papers in training set
Top 0.3%
2.1%
12
PLOS Computational Biology
1633 papers in training set
Top 14%
2.1%
13
Frontiers in Molecular Biosciences
100 papers in training set
Top 2%
1.7%
14
BMC Medical Genomics
36 papers in training set
Top 0.6%
1.5%
15
Database
51 papers in training set
Top 0.6%
1.2%
16
Journal of Bioinformatics and Systems Biology
14 papers in training set
Top 0.3%
1.2%
17
F1000Research
79 papers in training set
Top 3%
1.2%
18
Journal of Computational Biology
37 papers in training set
Top 0.4%
0.9%
19
GigaScience
172 papers in training set
Top 2%
0.9%
20
BioMed Research International
25 papers in training set
Top 2%
0.9%
21
NAR Genomics and Bioinformatics
214 papers in training set
Top 3%
0.9%
22
IEEE Journal of Biomedical and Health Informatics
34 papers in training set
Top 2%
0.9%
23
Methods
29 papers in training set
Top 0.4%
0.9%
24
Computational and Structural Biotechnology Journal
216 papers in training set
Top 8%
0.8%
25
Heliyon
146 papers in training set
Top 6%
0.8%
26
Biology
43 papers in training set
Top 3%
0.7%
27
Biology Methods and Protocols
53 papers in training set
Top 3%
0.7%
28
Cancers
200 papers in training set
Top 5%
0.7%
29
Biochimie
23 papers in training set
Top 0.5%
0.6%
30
International Journal of Molecular Sciences
453 papers in training set
Top 18%
0.6%