Back

A lightweight codon-based DNA Transformer for Regulatory Region Identification in the Genome

Karthik, A. S. P.; Das, A. B.

2026-05-07 bioinformatics
10.64898/2026.05.04.722647 bioRxiv
Show abstract

We developed a lightweight codon-based DNA Transformer equipped with multi-head self-attention and an adaptive classifier head, which achieves exon intron classification with high accuracy and also has moderate accuracy in CDS classification and splice site recognition. We named this model as ExIT (Exon-Intron Transformer). We have implemented codon tokenization for this model. This has been validated on the human genome with external validation from the chimpanzee genome. Further benchmarking has implied that our model is better than the existing models in the above tasks.

Matching journals

The top 3 journals account for 50% of the predicted probability mass.

1
Nucleic Acids Research
1128 papers in training set
Top 0.1%
37.9%
2
Nature Communications
4913 papers in training set
Top 18%
10.1%
3
BMC Bioinformatics
383 papers in training set
Top 2%
4.3%
50% of probability mass above
4
Bioinformatics
1061 papers in training set
Top 5%
4.0%
5
Bioinformatics Advances
184 papers in training set
Top 1%
3.6%
6
Genomics, Proteomics & Bioinformatics
171 papers in training set
Top 2%
3.6%
7
Briefings in Bioinformatics
326 papers in training set
Top 2%
3.1%
8
Genome Research
409 papers in training set
Top 2%
2.5%
9
Frontiers in Genetics
197 papers in training set
Top 3%
2.1%
10
Communications Biology
886 papers in training set
Top 5%
2.1%
11
PLOS Computational Biology
1633 papers in training set
Top 13%
2.1%
12
PLOS ONE
4510 papers in training set
Top 50%
1.9%
13
NAR Genomics and Bioinformatics
214 papers in training set
Top 2%
1.7%
14
Scientific Reports
3102 papers in training set
Top 59%
1.7%
15
IEEE Transactions on Computational Biology and Bioinformatics
17 papers in training set
Top 0.3%
1.5%
16
Genome Biology
555 papers in training set
Top 5%
1.3%
17
Database
51 papers in training set
Top 0.6%
1.1%
18
Advanced Science
249 papers in training set
Top 17%
0.9%
19
Computational and Structural Biotechnology Journal
216 papers in training set
Top 8%
0.9%
20
GigaScience
172 papers in training set
Top 3%
0.8%
21
PLOS Genetics
756 papers in training set
Top 14%
0.8%
22
Computers in Biology and Medicine
120 papers in training set
Top 4%
0.8%
23
IEEE/ACM Transactions on Computational Biology and Bioinformatics
32 papers in training set
Top 0.6%
0.7%
24
Journal of Molecular Biology
217 papers in training set
Top 4%
0.6%
25
Patterns
70 papers in training set
Top 3%
0.6%
26
Computer Methods and Programs in Biomedicine
27 papers in training set
Top 1%
0.6%
27
Nature Machine Intelligence
61 papers in training set
Top 4%
0.6%