Back

Benchmarking foundation models for splice site and exon annotation

He, Z.; Florea, L.

2026-02-23 bioinformatics
10.64898/2026.02.22.707219 bioRxiv
Show abstract

Recent foundation and deep learning models have brought a generational leap in improving the quality of genome annotation, particularly in identifying genes and their structural elements, including exons and splice sites. However, they are trained on reduced datasets that may not capture biological complexity, such as differences between coding versus non-coding, terminal versus internal, constitutive versus alternatively spliced, and transposable element (TE)-derived exons. We evaluate several foundation models for gene and splice site annotation, including the transformer-based SegmentNT, Enformer and Borzoi, coupled with a segmentation head for per-base resolution, and the CNN-based SpliceAI and AlphaGenome, along with a newly developed fine-tuned model, STEP2h, on different classes of gene elements as described above. We found that the performance of all methods is highest for the class of exons found in their training data class and decreases drastically for classes of exons poorly represented. In particular, performance is highest for protein-coding genes, coding exons, and constitutive exons, and decreases drastically by up to 2-4 fold for non-coding internal exons, terminal exons, and exons that undergo alternative splicing. Similarly, performance is impaired on LINE-1 and Alu-derived exons. In contrast, a locally developed CNN model fine-tuned on a specialized TE-exon dataset showed improved performance in this category. Our study highlights the outstanding challenges in gene and exon annotation when leveraging powerful foundation models, and the need for further fine-tuning on judiciously selected classes of data or task-specific models to capture a broader, more diverse spectrum of gene features.

Matching journals

The top 6 journals account for 50% of the predicted probability mass.

1
NAR Genomics and Bioinformatics
214 papers in training set
Top 0.1%
12.4%
2
Bioinformatics
1061 papers in training set
Top 2%
12.1%
3
Nucleic Acids Research
1128 papers in training set
Top 1%
10.2%
4
Bioinformatics Advances
184 papers in training set
Top 0.5%
6.2%
5
PLOS Computational Biology
1633 papers in training set
Top 6%
6.2%
6
Genome Biology
555 papers in training set
Top 1%
6.2%
50% of probability mass above
7
Genome Research
409 papers in training set
Top 0.7%
4.2%
8
Nature Communications
4913 papers in training set
Top 41%
3.5%
9
Briefings in Bioinformatics
326 papers in training set
Top 2%
3.5%
10
BMC Bioinformatics
383 papers in training set
Top 3%
3.0%
11
GigaScience
172 papers in training set
Top 0.7%
2.7%
12
PLOS Genetics
756 papers in training set
Top 6%
2.7%
13
Frontiers in Genetics
197 papers in training set
Top 3%
2.4%
14
BMC Genomics
328 papers in training set
Top 2%
2.1%
15
Computational and Structural Biotechnology Journal
216 papers in training set
Top 3%
2.0%
16
Nature Methods
336 papers in training set
Top 4%
1.8%
17
Cell Genomics
162 papers in training set
Top 3%
1.7%
18
PLOS ONE
4510 papers in training set
Top 59%
1.3%
19
Cell Systems
167 papers in training set
Top 11%
0.9%
20
IEEE Transactions on Computational Biology and Bioinformatics
17 papers in training set
Top 0.5%
0.9%
21
Genome Medicine
154 papers in training set
Top 7%
0.9%
22
Nature Biotechnology
147 papers in training set
Top 8%
0.7%
23
Nature Machine Intelligence
61 papers in training set
Top 4%
0.7%