Back

A novel method to select Reference Proteomes in UniProt

Raposo, P.; Martinez Marin, J. S.; Kim, G.; Insana, G.; Jyothi, D.; Luo, J.; Tunstall, T.; Consortium, U.; Orchard, S.; Steinegger, M.; Martin, M.

2026-05-14 bioinformatics
10.64898/2026.05.12.720148 bioRxiv
Show abstract

MotivationThe ongoing revolution in genome sequencing is delivering an unprecedented number of genome assemblies to global repositories, resulting in an overwhelming amount of data imported to UniProt in the form of proteomes. To manage this growth sustainably, there is a need for a systematic workflow to select the best proteomes. ResultsWe propose a novel pipeline for cellular organisms to select the best Reference Proteomes, i.e. those that best represent the protein space of a species. The pipeline uses a clustering algorithm based on MMseqs2 to select the minimum number of Reference Proteomes whilst maximising the representation of the protein space for each species. Additionally, we aligned our viral Reference Proteomes with the exemplar genome set defined by the International Committee on Taxonomy of Viruses. Because this method ensures that all species are represented with at least one Reference Proteome, the UniProt Knowledgebase increased the number of Reference Proteomes of 36% and covering 34% more species in the Tree of Life. The UniProt Knowledgebase will mainly retain proteins from Reference Proteomes and therefore this method reduces the overall number of proteins by 43%, leading to a more concise yet representative knowledgebase. Availability and Implementationhttps://www.uniprot.org/proteomes Contactraposo@ebi.ac.uk Supplementary informationSupplementary data are available at Bioinformatics online.

Matching journals

The top 2 journals account for 50% of the predicted probability mass.

1
Bioinformatics
1061 papers in training set
Top 0.4%
39.8%
2
BMC Bioinformatics
383 papers in training set
Top 0.6%
12.5%
50% of probability mass above
3
Nucleic Acids Research
1128 papers in training set
Top 4%
4.9%
4
GigaScience
172 papers in training set
Top 0.4%
4.0%
5
Wellcome Open Research
57 papers in training set
Top 0.3%
3.6%
6
Nature Communications
4913 papers in training set
Top 45%
2.6%
7
PLOS Computational Biology
1633 papers in training set
Top 12%
2.6%
8
Journal of Proteome Research
215 papers in training set
Top 1.0%
2.4%
9
Scientific Data
174 papers in training set
Top 0.8%
2.1%
10
NAR Genomics and Bioinformatics
214 papers in training set
Top 1%
2.1%
11
Virus Evolution
140 papers in training set
Top 0.7%
1.7%
12
Peer Community Journal
254 papers in training set
Top 2%
1.7%
13
Database
51 papers in training set
Top 0.4%
1.7%
14
Bioinformatics Advances
184 papers in training set
Top 4%
1.2%
15
Frontiers in Bioinformatics
45 papers in training set
Top 0.5%
1.0%
16
Computational and Structural Biotechnology Journal
216 papers in training set
Top 7%
1.0%
17
Viruses
318 papers in training set
Top 4%
0.9%
18
PLOS ONE
4510 papers in training set
Top 63%
0.9%
19
BMC Biology
248 papers in training set
Top 4%
0.8%
20
Scientific Reports
3102 papers in training set
Top 74%
0.8%
21
Molecular Ecology Resources
161 papers in training set
Top 1%
0.7%
22
Journal of Medical Virology
137 papers in training set
Top 5%
0.7%
23
mBio
750 papers in training set
Top 12%
0.7%
24
International Journal of Molecular Sciences
453 papers in training set
Top 17%
0.7%
25
Microbial Genomics
204 papers in training set
Top 2%
0.7%
26
Microbiology Resource Announcements
22 papers in training set
Top 1%
0.7%