Back

geneML: Gene annotation across diverse fungal species using deep learning

Vader, L.; Harvey, C. J.; Weber, T.; Hon, L. S.

2026-05-21 bioinformatics
10.64898/2026.05.18.725946 bioRxiv
Show abstract

Accurate gene prediction remains a major bottleneck in fungal genomics, where lineage diversity and alternative splicing challenge existing ab initio methods. Here, we present geneML, a deep learning-based gene prediction tool tailored to fungal genomes. Across nine reference genomes spanning diverse fungal taxa, geneML improved gene-level F1 score from 64.9 to 67.1 compared to BRAKER3 with protein-based hints, driven by substantially higher recall (69.0 vs. 64.1) at equivalent precision. geneML also remains fast, averaging around 6 minutes per genome on a standard 8-core CPU. A key feature of geneML is its ability to predict alternative transcripts. Compared to Fusarium graminearum Iso-Seq control data, it achieves 41.1% transcript recall and 71.1% precision, outperforming AUGUSTUS (33.8% recall, 48.9% precision), one of the few tools that support isoform prediction. The predicted transcript diversity is consistent with experimentally observed fungal alternative splicing patterns. Reannotation of the curated training dataset further suggests improved biological completeness, with geneML recovering 15.3% more genes containing complete PFAM domains than the reference annotation. These results demonstrate that geneML enables faster, more sensitive, and more biologically informative fungal genome annotation. geneML is available as an open-source command-line tool at https://github.com/hexagonbio/geneML. Key Points- geneML improves gene prediction accuracy over both classical and recent deep learning-based methods, while substantially improving recall. - geneML predicts alternative transcripts with higher precision and recall than AUGUSTUS, expanding functional annotation. - Runtime was 32-fold decreased over BRAKER3, enabling efficient high-throughput genome annotation. - geneML identifies novel genes and recovers missing annotations, especially in under-annotated non-Ascomycete genomes.

Matching journals

The top 4 journals account for 50% of the predicted probability mass.

1
Bioinformatics
1061 papers in training set
Top 1.0%
23.3%
2
BMC Bioinformatics
383 papers in training set
Top 0.2%
19.3%
3
Genome Biology
555 papers in training set
Top 1.0%
6.6%
4
NAR Genomics and Bioinformatics
214 papers in training set
Top 0.2%
6.6%
50% of probability mass above
5
Nucleic Acids Research
1128 papers in training set
Top 4%
4.4%
6
PLOS Computational Biology
1633 papers in training set
Top 7%
4.4%
7
Bioinformatics Advances
184 papers in training set
Top 1.0%
4.1%
8
BMC Genomics
328 papers in training set
Top 0.7%
3.7%
9
Briefings in Bioinformatics
326 papers in training set
Top 2%
3.0%
10
Microbial Genomics
204 papers in training set
Top 0.7%
2.8%
11
Cell Reports Methods
141 papers in training set
Top 3%
1.5%
12
Genome Research
409 papers in training set
Top 3%
1.4%
13
Nature Communications
4913 papers in training set
Top 56%
1.3%
14
Computational and Structural Biotechnology Journal
216 papers in training set
Top 6%
1.3%
15
GigaScience
172 papers in training set
Top 2%
1.3%
16
Nature Methods
336 papers in training set
Top 5%
1.0%
17
Scientific Reports
3102 papers in training set
Top 72%
0.8%
18
Molecular Ecology Resources
161 papers in training set
Top 1.0%
0.8%
19
PLOS ONE
4510 papers in training set
Top 66%
0.8%
20
Genome Medicine
154 papers in training set
Top 8%
0.8%
21
BMC Biology
248 papers in training set
Top 6%
0.5%
22
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 49%
0.5%
23
G3: Genes, Genomes, Genetics
222 papers in training set
Top 1%
0.5%