Back

Novel and improved Caenorhabditis briggsae gene models generated by community curation

Moya, N. D.; Stevens, L.; Miller, I. R.; Galindo, J. L.; Bardas, A. D.; Yeo, C.; Rozenich, A. J.; Xu, M.; Koh, E. S. H.; Andersen, E. C.

2023-05-18 genomics
10.1101/2023.05.16.541014 bioRxiv
Show abstract

BackgroundThe nematode Caenorhabditis briggsae has been used as a model for genomics studies compared to Caenorhabditis elegans because of its striking morphological and behavioral similarities. These studies yielded numerous findings that have expanded our understanding of nematode development and evolution. However, the potential of C. briggsae to study nematode biology is limited by the quality of its genome resources. The reference genome and gene models for the C. briggsae laboratory strain AF16 have not been developed to the same extent as C. elegans. The recent publication of a new chromosome-level reference genome for QX1410, a C. briggsae wild strain closely related to AF16, has provided the first step to bridge the gap between C. elegans and C. briggsae genome resources. Currently, the QX1410 gene models consist of protein-coding gene predictions generated from short- and long-read transcriptomic data. Because of the limitations of gene prediction software, the existing gene models for QX1410 contain numerous errors in their structure and coding sequences. In this study, a team of researchers manually inspected over 21,000 software-derived gene models and underlying transcriptomic data to improve the protein-coding gene models of the C. briggsae QX1410 genome. ResultsWe designed a detailed workflow to train a team of nine students to manually curate genes using RNA read alignments and predicted gene models. We manually inspected the gene models using the genome annotation editor, Apollo, and proposed corrections to the coding sequences of over 8,000 genes. Additionally, we modeled thousands of putative isoforms and untranslated regions. We exploited the conservation of protein sequence length between C. briggsae and C. elegans to quantify the improvement in protein-coding gene model quality before and after curation. Manual curation led to a substantial improvement in the protein sequence length accuracy of QX1410 genes. We also compared the curated QX1410 gene models against the existing AF16 gene models. The manual curation efforts yielded QX1410 gene models that are similar in quality to the extensively curated AF16 gene models in terms of protein-length accuracy and biological completeness scores. Collinear alignment analysis between the QX1410 and AF16 genomes revealed over 1,800 genes affected by spurious duplications and inversions in the AF16 genome that are now resolved in the QX1410 genome. ConclusionsCommunity-based, manual curation using transcriptome data is an effective approach to improve the quality of software-derived protein-coding genes. Comparative genomic analysis using a related species with high-quality reference genome(s) and gene models can be used to quantify improvements in gene model quality in a newly sequenced genome. The detailed protocols provided in this work can be useful for future large-scale manual curation projects in other species. The chromosome-level reference genome for the C. briggsae strain QX1410 far surpasses the quality of the genome of the laboratory strain AF16, and our manual curation efforts have brought the QX1410 gene models to a comparable level of quality to the previous reference, AF16. The improved genome resources for C. briggsae provide reliable tools for the study of Caenorhabditis biology and other related nematodes.

Matching journals

The top 4 journals account for 50% of the predicted probability mass.

1
G3 Genes|Genomes|Genetics
351 papers in training set
Top 0.1%
22.1%
2
G3: Genes, Genomes, Genetics
222 papers in training set
Top 0.1%
14.1%
3
BMC Genomics
328 papers in training set
Top 0.1%
9.9%
4
Gigabyte
60 papers in training set
Top 0.1%
9.0%
50% of probability mass above
5
GigaScience
172 papers in training set
Top 0.3%
4.8%
6
Scientific Reports
3102 papers in training set
Top 39%
3.5%
7
BMC Bioinformatics
383 papers in training set
Top 3%
3.5%
8
F1000Research
79 papers in training set
Top 0.9%
2.3%
9
BMC Biology
248 papers in training set
Top 0.7%
2.0%
10
Nucleic Acids Research
1128 papers in training set
Top 10%
1.9%
11
PLOS ONE
4510 papers in training set
Top 52%
1.8%
12
Molecular Ecology Resources
161 papers in training set
Top 0.6%
1.7%
13
Database
51 papers in training set
Top 0.4%
1.6%
14
PLOS Neglected Tropical Diseases
378 papers in training set
Top 4%
1.5%
15
Genome Biology and Evolution
280 papers in training set
Top 1%
1.3%
16
Genetics
225 papers in training set
Top 3%
1.2%
17
Journal of Heredity
35 papers in training set
Top 0.1%
0.9%
18
Ecology and Evolution
232 papers in training set
Top 4%
0.9%
19
Bioinformatics Advances
184 papers in training set
Top 5%
0.7%
20
Bioinformatics
1061 papers in training set
Top 10%
0.7%
21
Frontiers in Genetics
197 papers in training set
Top 10%
0.7%
22
PLOS Computational Biology
1633 papers in training set
Top 25%
0.7%
23
Scientific Data
174 papers in training set
Top 3%
0.7%
24
Developmental Dynamics
50 papers in training set
Top 0.8%
0.7%
25
PeerJ
261 papers in training set
Top 18%
0.6%