Significantly Improved Mouse and Rat Genome Annotation Using Sequence Read Archive RNA-seq Data
Meng, F.; Turner, D. L.; Hagenauer, M. H.; Watson, S.; Akil, H.
Show abstract
To detect currently unannotated genes with low expression levels with high sensitivity and accuracy, we developed a new exon->gene->transcript annotation pipeline that can identify previously undetected multi-exon transcripts using large volumes of RNA-Seq data. Our pipeline incorporates three new algorithms: 1) model-based spliced exon detection, 2) exon-to-gene assignment across multiple tissue/datasets through exon community discovery, and 3) ranking top transcripts by a stepwise minimum flow procedure. The design of our pipeline allowed us to leverage hundreds of Tbases of public RNA-seq data as input to improve mouse and rat genome annotation. Using this data, our pipeline identified close to 15K and 21K unannotated genes in GENCODE M37 and ENSEMBL 114 for mouse and rat, respectively. Each species also gained over 200K predicted transcripts containing at least one new exon, although most were transcripts from GENCODE/ENSEMBL annotated genes with newly assigned exons. To make our genome annotation available for common use, we have packaged this new annotation in standard file formats for the analysis of bulk and single cell RNA-seq data (GTF, 10X genome files). We have also provided two use examples which demonstrate the utility of our newly annotated genes in functional analyses, showing that their expression can be differentially regulated in relationship to cell type and selective breeding. Due to the efficiency provided by our pipeline, we expect that as new RNA-seq data become available in the coming years it will significantly benefit rat gene/transcript annotation, eventually enabling us to approach the target of complete gene and transcript annotation.
Matching journals
The top 3 journals account for 50% of the predicted probability mass.