Back

SpliceRead: Improving Canonical and Non-Canonical Splice Site Prediction with Residual Blocks and Synthetic Data Augmentation

Thapa, S.; Samderiya, K.; Menon, R.; Oluwadare, O.

2026-02-09 bioinformatics
10.64898/2026.02.05.703825 bioRxiv
Show abstract

Accurate splice site prediction is fundamental to understanding gene expression and its associated disorders. However, most existing models are biased toward frequent canonical sites, limiting their ability to detect rare but biologically important non-canonical variants. These models often rely heavily on large, imbalanced datasets that fail to capture the sequence diversity of non-canonical sites, leading to high false-negative rates. Here, we present SpliceRead, a novel deep learning model designed to improve the classification of both canonical and non-canonical splice sites using a combination of residual convolutional blocks and synthetic data augmentation. SpliceRead employs a data augmentation method to generate diverse non-canonical sequences and uses residual connections to enhance gradient flow and capture subtle genomic features. Trained and tested on a multi-species dataset of 400- and 600-nucleotide sequences, SpliceRead consistently outperforms state-of-the-art models across all key metrics, including F1-score, accuracy, precision, and recall. Notably, it achieves a substantially lower non-canonical misclassification rate than baseline methods. Extensive evaluations, including cross-validation, cross-species testing, and input-length generalization, confirm its robustness and adaptability. SpliceRead offers a powerful, generalizable framework for splice site prediction, particularly in challenging, low-frequency sequence scenarios, and paves the way for more accurate gene annotation in both model and non-model organisms.The open sourced code of SpliceRead and a detailed documentation is available at The open-sourced code of SpliceRead and detailed documentation are available at https://github.com/OluwadareLab/SpliceRead.

Matching journals

The top 5 journals account for 50% of the predicted probability mass.

1
Bioinformatics
1061 papers in training set
Top 2%
18.0%
2
Genome Biology
555 papers in training set
Top 0.1%
18.0%
3
Nucleic Acids Research
1128 papers in training set
Top 3%
6.6%
4
Nature Methods
336 papers in training set
Top 2%
6.1%
5
Nature Communications
4913 papers in training set
Top 31%
6.1%
50% of probability mass above
6
Nature Biotechnology
147 papers in training set
Top 2%
6.1%
7
Nature Machine Intelligence
61 papers in training set
Top 1%
3.5%
8
Briefings in Bioinformatics
326 papers in training set
Top 2%
3.5%
9
The American Journal of Human Genetics
206 papers in training set
Top 1%
3.0%
10
Genome Research
409 papers in training set
Top 1%
2.6%
11
Cell Systems
167 papers in training set
Top 5%
2.6%
12
Bioinformatics Advances
184 papers in training set
Top 2%
2.0%
13
NAR Genomics and Bioinformatics
214 papers in training set
Top 2%
1.8%
14
Genome Medicine
154 papers in training set
Top 5%
1.6%
15
BMC Bioinformatics
383 papers in training set
Top 5%
1.6%
16
Cell Genomics
162 papers in training set
Top 4%
1.4%
17
Science
429 papers in training set
Top 16%
1.3%
18
Advanced Science
249 papers in training set
Top 15%
1.2%
19
PLOS Computational Biology
1633 papers in training set
Top 20%
1.2%
20
Nature Computational Science
50 papers in training set
Top 1%
0.9%
21
Genomics, Proteomics & Bioinformatics
171 papers in training set
Top 5%
0.9%
22
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 42%
0.9%
23
Nature
575 papers in training set
Top 16%
0.7%
24
PLOS ONE
4510 papers in training set
Top 69%
0.7%