Back

Significantly Improved Mouse and Rat Genome Annotation Using Sequence Read Archive RNA-seq Data

Meng, F.; Turner, D. L.; Hagenauer, M. H.; Watson, S.; Akil, H.

2026-03-09 genomics
10.64898/2026.03.06.709975 bioRxiv
Show abstract

To detect currently unannotated genes with low expression levels with high sensitivity and accuracy, we developed a new exon->gene->transcript annotation pipeline that can identify previously undetected multi-exon transcripts using large volumes of RNA-Seq data. Our pipeline incorporates three new algorithms: 1) model-based spliced exon detection, 2) exon-to-gene assignment across multiple tissue/datasets through exon community discovery, and 3) ranking top transcripts by a stepwise minimum flow procedure. The design of our pipeline allowed us to leverage hundreds of Tbases of public RNA-seq data as input to improve mouse and rat genome annotation. Using this data, our pipeline identified close to 15K and 21K unannotated genes in GENCODE M37 and ENSEMBL 114 for mouse and rat, respectively. Each species also gained over 200K predicted transcripts containing at least one new exon, although most were transcripts from GENCODE/ENSEMBL annotated genes with newly assigned exons. To make our genome annotation available for common use, we have packaged this new annotation in standard file formats for the analysis of bulk and single cell RNA-seq data (GTF, 10X genome files). We have also provided two use examples which demonstrate the utility of our newly annotated genes in functional analyses, showing that their expression can be differentially regulated in relationship to cell type and selective breeding. Due to the efficiency provided by our pipeline, we expect that as new RNA-seq data become available in the coming years it will significantly benefit rat gene/transcript annotation, eventually enabling us to approach the target of complete gene and transcript annotation.

Matching journals

The top 3 journals account for 50% of the predicted probability mass.

1
BMC Genomics
328 papers in training set
Top 0.1%
33.4%
2
Genome Research
409 papers in training set
Top 0.1%
14.9%
3
NAR Genomics and Bioinformatics
214 papers in training set
Top 0.3%
4.9%
50% of probability mass above
4
Frontiers in Genetics
197 papers in training set
Top 1%
4.9%
5
Nucleic Acids Research
1128 papers in training set
Top 4%
4.6%
6
Genome Biology
555 papers in training set
Top 3%
3.1%
7
G3 Genes|Genomes|Genetics
351 papers in training set
Top 0.9%
2.6%
8
Nature Communications
4913 papers in training set
Top 46%
2.1%
9
GigaScience
172 papers in training set
Top 0.9%
2.1%
10
Bioinformatics Advances
184 papers in training set
Top 2%
1.9%
11
PLOS ONE
4510 papers in training set
Top 52%
1.8%
12
G3: Genes, Genomes, Genetics
222 papers in training set
Top 0.4%
1.7%
13
Genomics
60 papers in training set
Top 1%
1.5%
14
Database
51 papers in training set
Top 0.5%
1.5%
15
Scientific Reports
3102 papers in training set
Top 62%
1.5%
16
Bioinformatics
1061 papers in training set
Top 8%
1.2%
17
PLOS Genetics
756 papers in training set
Top 11%
1.2%
18
BMC Bioinformatics
383 papers in training set
Top 6%
0.8%
19
Cell Genomics
162 papers in training set
Top 6%
0.8%
20
Genome Medicine
154 papers in training set
Top 8%
0.8%
21
Genetics
225 papers in training set
Top 4%
0.7%
22
Communications Biology
886 papers in training set
Top 28%
0.7%
23
Development
440 papers in training set
Top 4%
0.7%
24
Genes
126 papers in training set
Top 4%
0.7%
25
Nature Methods
336 papers in training set
Top 7%
0.5%
26
Biology Methods and Protocols
53 papers in training set
Top 4%
0.5%
27
Journal of Genetics and Genomics
36 papers in training set
Top 3%
0.5%