Back

More than 2,500 coding genes in the human reference gene set still have unsettled status

Maquedano, M.; Cerdan-Velez, D.; Tress, M. L.

2024-12-09 genomics
10.1101/2024.12.05.626965 bioRxiv
Show abstract

In 2018 we analysed the three main repositories for the human proteome, Ensembl/GENCODE, RefSeq and UniProtKB. They disagreed on the coding status of one of every eight annotated coding genes. The analysis inspired bilateral collaborations between annotation groups. Here we have repeated our analysis with updated versions of the three reference coding gene sets. Superficially, little appears to have changed. Although there are slightly fewer genes predicted as coding overall, the three groups still disagree on the status of 2,606 annotated genes. However, a comparison without read-through genes and immunoglobulin fragments shows that the three reference sets have merged or reclassified more than 700 genes since the last analysis and that just 0.6% of Ensembl/GENCODE coding genes are not also annotated by the other two reference sets. We used eight features indicative of non-coding genes to examine the 21,873 coding genes annotated across the three reference sets. We found that more than 2,000 had one or more potential non-coding features. While some of these genes will be protein coding, we believe that most are likely to be non-coding genes or pseudogenes. Our results suggest that annotators still vastly overestimate the number of true coding genes.

Matching journals

The top 7 journals account for 50% of the predicted probability mass.

1
F1000Research
79 papers in training set
Top 0.1%
14.7%
2
Database
51 papers in training set
Top 0.1%
14.3%
3
Frontiers in Genetics
197 papers in training set
Top 0.7%
6.8%
4
PLOS ONE
4510 papers in training set
Top 35%
4.2%
5
Nucleic Acids Research
1128 papers in training set
Top 5%
4.0%
6
NAR Genomics and Bioinformatics
214 papers in training set
Top 0.7%
3.6%
7
Scientific Reports
3102 papers in training set
Top 37%
3.6%
50% of probability mass above
8
Journal of Molecular Evolution
21 papers in training set
Top 0.1%
3.6%
9
Genome Biology and Evolution
280 papers in training set
Top 0.4%
3.6%
10
Genome Research
409 papers in training set
Top 1%
2.9%
11
PeerJ
261 papers in training set
Top 7%
1.7%
12
PLOS Computational Biology
1633 papers in training set
Top 16%
1.7%
13
Genes
126 papers in training set
Top 1%
1.5%
14
Peer Community Journal
254 papers in training set
Top 3%
1.2%
15
BioData Mining
15 papers in training set
Top 0.5%
1.2%
16
GigaScience
172 papers in training set
Top 2%
1.2%
17
Heliyon
146 papers in training set
Top 3%
1.2%
18
G3 Genes|Genomes|Genetics
351 papers in training set
Top 2%
1.1%
19
PLOS Genetics
756 papers in training set
Top 13%
0.9%
20
Computational and Structural Biotechnology Journal
216 papers in training set
Top 8%
0.9%
21
International Journal of Molecular Sciences
453 papers in training set
Top 13%
0.9%
22
Bioinformatics Advances
184 papers in training set
Top 4%
0.8%
23
BMC Medical Genomics
36 papers in training set
Top 1%
0.8%
24
G3: Genes, Genomes, Genetics
222 papers in training set
Top 0.9%
0.8%
25
Nature Communications
4913 papers in training set
Top 63%
0.7%
26
Biology
43 papers in training set
Top 3%
0.7%
27
Genome Biology
555 papers in training set
Top 7%
0.7%
28
Communications Biology
886 papers in training set
Top 24%
0.7%
29
Genomics
60 papers in training set
Top 3%
0.7%
30
Journal of Molecular Biology
217 papers in training set
Top 4%
0.7%