geneML: Gene annotation across diverse fungal species using deep learning
Vader, L.; Harvey, C. J.; Weber, T.; Hon, L. S.
Show abstract
Accurate gene prediction remains a major bottleneck in fungal genomics, where lineage diversity and alternative splicing challenge existing ab initio methods. Here, we present geneML, a deep learning-based gene prediction tool tailored to fungal genomes. Across nine reference genomes spanning diverse fungal taxa, geneML improved gene-level F1 score from 64.9 to 67.1 compared to BRAKER3 with protein-based hints, driven by substantially higher recall (69.0 vs. 64.1) at equivalent precision. geneML also remains fast, averaging around 6 minutes per genome on a standard 8-core CPU. A key feature of geneML is its ability to predict alternative transcripts. Compared to Fusarium graminearum Iso-Seq control data, it achieves 41.1% transcript recall and 71.1% precision, outperforming AUGUSTUS (33.8% recall, 48.9% precision), one of the few tools that support isoform prediction. The predicted transcript diversity is consistent with experimentally observed fungal alternative splicing patterns. Reannotation of the curated training dataset further suggests improved biological completeness, with geneML recovering 15.3% more genes containing complete PFAM domains than the reference annotation. These results demonstrate that geneML enables faster, more sensitive, and more biologically informative fungal genome annotation. geneML is available as an open-source command-line tool at https://github.com/hexagonbio/geneML. Key Points- geneML improves gene prediction accuracy over both classical and recent deep learning-based methods, while substantially improving recall. - geneML predicts alternative transcripts with higher precision and recall than AUGUSTUS, expanding functional annotation. - Runtime was 32-fold decreased over BRAKER3, enabling efficient high-throughput genome annotation. - geneML identifies novel genes and recovers missing annotations, especially in under-annotated non-Ascomycete genomes.
Matching journals
The top 4 journals account for 50% of the predicted probability mass.