Back

Lessons learned from manual curation of thousands of gene models in the nematode Pristionchus pacificus

Roedelsperger, C.; Agyal, N.; Quiobe, S. P.; Wu, H.; Ibarra-Morales, D.; Sommer, R. J.

2026-02-19 genomics
10.64898/2026.02.18.706511 bioRxiv
Show abstract

Continuous developments in sequencing technologies have led to the generation of chromosome-scale genome assemblies across the whole tree of life, but our ability to annotate genomes has lacked behind. One major problem consists in the fact that typically not all genes are expressed at detectable levels at any given life stage or environment. Therefore, available transcriptome data needs to be complemented by gene prediction programs and protein homology evidence. However, how to optimally combine these different data types is not well understood. Here, we present a case study, where we community curated gene annotations of the Pristionchus pacificus strain RSC011. By incorporation of new Iso-seq and RNA-seq data and genome-wide screening, we identified and corrected more than 7,500 ([~]24%) gene models. While the improved gene annotation for the RSC011 strain will be useful for the P. pacificus community, our study reveals several gene annotation problems that may affect data from other species. Among these, we identified assembly errors, artificial transcript fusions resulting from overlapping genes and polycistronic RNAs, falsely called open reading frames, and error propagation based on homology data as frequent sources of gene annotation errors. Thus, our findings may be helpful in guiding future efforts to annotate genomes across different taxonomic groups.

Matching journals

The top 8 journals account for 50% of the predicted probability mass.

1
BMC Genomics
328 papers in training set
Top 0.1%
14.5%
2
Gigabyte
60 papers in training set
Top 0.1%
12.5%
3
G3 Genes|Genomes|Genetics
351 papers in training set
Top 0.3%
6.7%
4
DNA Research
23 papers in training set
Top 0.1%
3.6%
5
Scientific Reports
3102 papers in training set
Top 36%
3.6%
6
GigaScience
172 papers in training set
Top 0.5%
3.5%
7
PeerJ
261 papers in training set
Top 3%
3.5%
8
G3: Genes, Genomes, Genetics
222 papers in training set
Top 0.2%
3.0%
50% of probability mass above
9
Scientific Data
174 papers in training set
Top 0.6%
3.0%
10
NAR Genomics and Bioinformatics
214 papers in training set
Top 1%
2.7%
11
Molecular Ecology Resources
161 papers in training set
Top 0.4%
2.6%
12
Frontiers in Genetics
197 papers in training set
Top 3%
2.3%
13
Genome Biology and Evolution
280 papers in training set
Top 0.8%
2.1%
14
Peer Community Journal
254 papers in training set
Top 1%
2.1%
15
Frontiers in Marine Science
55 papers in training set
Top 0.5%
2.1%
16
Aquaculture
29 papers in training set
Top 0.3%
1.9%
17
PLOS ONE
4510 papers in training set
Top 51%
1.9%
18
F1000Research
79 papers in training set
Top 2%
1.7%
19
Microbiology Resource Announcements
22 papers in training set
Top 0.5%
1.3%
20
Journal of Heredity
35 papers in training set
Top 0.1%
1.3%
21
BMC Bioinformatics
383 papers in training set
Top 6%
1.2%
22
BMC Biology
248 papers in training set
Top 2%
1.1%
23
mSystems
361 papers in training set
Top 6%
1.1%
24
Journal of Molecular Evolution
21 papers in training set
Top 0.3%
0.9%
25
Microbial Genomics
204 papers in training set
Top 2%
0.9%
26
PLOS Computational Biology
1633 papers in training set
Top 22%
0.9%
27
Genomics
60 papers in training set
Top 2%
0.8%
28
Frontiers in Microbiology
375 papers in training set
Top 9%
0.7%
29
Ecology and Evolution
232 papers in training set
Top 4%
0.7%
30
PLOS Neglected Tropical Diseases
378 papers in training set
Top 5%
0.7%