Back

An expanded reference catalog of translated open reading frames for biomedical research

Chothani, S.; Ruiz-Orera, J.; Tierney, J. A. S.; Clauwaert, J.; Deutsch, E. W.; Alba, M. M.; Aspden, J. L.; Baranov, P. V.; Bazzini, A. A.; Bruford, E. A.; Brunet, M. A.; Cardon, T.; Carvunis, A.-R.; Casola, C.; Choudhary, J. S.; Dean, K.; Faridi, P.; Fierro-Monti, I.; Fournier, I.; Frankish, A.; Gerstein, M.; Hubner, N.; Jiang, Y.; Kellis, M.; Kok, L. W.; Martinez, T. F.; Menschaert, G.; Ni, P.; Orchard, S.; Roucou, X.; Rozowsky, J.; Salzet, M.; Siragusa, M.; Slavoff, S.; Swirski, M. I.; Valen, E.; Vizcaino, J. A.; Wacholder, A.; Wu, W.; Xie, Z.; Yang, Y. T.; Moritz, R. L.; Mudge, J.; van Hee

2025-07-07 genomics
10.1101/2025.07.03.662928 bioRxiv
Show abstract

Non-canonical (i.e., unannotated) open reading frames (ncORFs) have until recently been omitted from reference genome annotations, despite evidence of their translation, limiting their incorporation into biomedical research. To address this, in 2022, we initiated the TransCODE consortium and built the first community-driven consensus catalog of human ncORFs, which was openly distributed to the research community via Ensembl-GENCODE. While this catalog represented a starting point for reference ncORF annotation, major technical and scientific issues remained. In particular, this initial catalogue had no standardized framework to judge the evidence of translation for individual ncORFs. Here, we present an expanded and refined catalog of the human reference annotation of ncORFs. By incorporating more datasets and by lifting constraints on ORF length and start-codon, we define a comprehensive set of 28,359 ncORFs that is nearly four times the size of the previous catalog. Furthermore, to aid users who wish to work with ncORFs with the strongest and most reproducible signals of translation, we utilized a data-driven framework (i.e. translation signature scores) to assess the accumulated evidence for any individual ncORF. Using this approach, we derive a subset of 7,888 ncORFs with translation evidence on par with canonical protein-coding genes, which we refer to as the Primary set. This set can serve as a reliable reference for downstream analyses and validation, with a particular emphasis on high quality. Overall, this update reflects continual community-driven efforts to make ncORFs accessible and actionable to the broader research public and further iterations of the catalog will continue to expand and refine this resource.

Matching journals

The top 4 journals account for 50% of the predicted probability mass.

1
Database
51 papers in training set
Top 0.1%
26.7%
2
Nucleic Acids Research
1128 papers in training set
Top 1%
10.8%
3
Scientific Data
174 papers in training set
Top 0.2%
8.7%
4
Computational and Structural Biotechnology Journal
216 papers in training set
Top 0.5%
6.6%
50% of probability mass above
5
Genomics, Proteomics & Bioinformatics
171 papers in training set
Top 2%
3.7%
6
NAR Genomics and Bioinformatics
214 papers in training set
Top 0.6%
3.7%
7
Bioinformatics
1061 papers in training set
Top 6%
3.2%
8
Scientific Reports
3102 papers in training set
Top 44%
2.7%
9
Frontiers in Genetics
197 papers in training set
Top 4%
1.9%
10
Bioinformatics Advances
184 papers in training set
Top 2%
1.9%
11
PLOS Computational Biology
1633 papers in training set
Top 15%
1.7%
12
Cell Genomics
162 papers in training set
Top 3%
1.7%
13
Genomics
60 papers in training set
Top 0.9%
1.7%
14
Genome Biology
555 papers in training set
Top 5%
1.5%
15
Life Science Alliance
263 papers in training set
Top 0.4%
1.5%
16
International Journal of Molecular Sciences
453 papers in training set
Top 9%
1.4%
17
BMC Bioinformatics
383 papers in training set
Top 5%
1.4%
18
PLOS ONE
4510 papers in training set
Top 60%
1.3%
19
Genome Medicine
154 papers in training set
Top 6%
1.0%
20
Genes
126 papers in training set
Top 2%
0.9%
21
Nature Communications
4913 papers in training set
Top 60%
0.8%
22
Journal of Genetics and Genomics
36 papers in training set
Top 2%
0.8%
23
BMC Biology
248 papers in training set
Top 4%
0.7%
24
Briefings in Bioinformatics
326 papers in training set
Top 7%
0.7%
25
DNA Research
23 papers in training set
Top 0.6%
0.7%
26
GigaScience
172 papers in training set
Top 4%
0.7%
27
The Plant Journal
197 papers in training set
Top 4%
0.5%