Back

Evaluation of Protein Reference Database Reduction and Its Impact on Peptide-Centric Metaproteomics

Vande Moortele, T.; Van de Vyver, S.; Binke, B.-B.; Van Den Bossche, T.; Dawyndt, P.; Martens, L.; Mesuere, B.; Verschaffelt, P.

2026-02-25 bioinformatics
10.64898/2026.02.24.707692 bioRxiv
Show abstract

Introduction/BackgroundRecent large-scale restructurings of UniProtKB included removal of redundant entries, exclusion of taxonomically unclassified organisms, and a shift toward a more reference-proteome-centered approach. This raised concerns about the stability of peptide-centric metaproteomics workflows. In parallel, metagenomics-assisted "targeted" database restriction is often proposed to reduce ambiguity, but its net impact on peptide-centric interpretation remains unclear. MethodsWe assessed the impact of three complementary factors on the taxonomic profiling of metaproteomics analyses: (i) successive global UniProtKB reductions, (ii) metagenomics-derived targeted database restriction, and (iii) Unipepts internal taxon validation filter. Peptide lists from two public metaproteomics datasets (human gut and marine hatchery) were analysed with Unipept and compared across sequential UniProtKB configurations and custom SSU/LSU-derived filtered databases. ResultsAcross both environments, progressive UniProtKB downsizing reduced peptide coverage, did not fundamentally alter the most abundant taxa, and substantially lowered ambiguous root-level assignments. This suggests that the reduction in ambiguity stemmed from decreased redundancy, rather than a loss of meaningful biological information. Metagenomics-assisted targeted filtering introduced a clear trade-off: it markedly reduced peptide matches, but with only modest changes in resolution at lower taxonomic ranks. It, however, consistently reduced non-specific root-level assignments. The effects on taxon discoverability and relative abundances was heavily dependent on the environment, with stronger shifts observed in the, lesser represented, marine dataset. Finally, the added benefit of Unipepts internal taxon validation filter decreased across newer, more curated database configurations. It had the largest impact on older, more inclusive releases and became minimal under the reference-proteome-focused setup. Discussion/ConclusionOverall, UniProtKB restructuring does not destabilize peptide-centric metaproteomic analyses. Instead, it tends to reduce ambiguity while preserving high-level community structure. Targeted database restriction offers a trade-off between sensitivity and reduced ambiguity in a strongly context-dependent manner. As UniProtKB becomes increasingly more curated and reference-proteome-centered, the need for additional internal taxonomic filtering in Unipept appears to diminish.

Matching journals

The top 5 journals account for 50% of the predicted probability mass.

1
Journal of Proteome Research
215 papers in training set
Top 0.1%
22.7%
2
BMC Bioinformatics
383 papers in training set
Top 0.6%
12.6%
3
PeerJ
261 papers in training set
Top 0.7%
6.4%
4
Peer Community Journal
254 papers in training set
Top 0.5%
4.9%
5
PLOS ONE
4510 papers in training set
Top 34%
4.3%
50% of probability mass above
6
PROTEOMICS
35 papers in training set
Top 0.2%
3.6%
7
PLOS Computational Biology
1633 papers in training set
Top 10%
3.6%
8
Bioinformatics
1061 papers in training set
Top 6%
2.7%
9
Scientific Reports
3102 papers in training set
Top 45%
2.6%
10
Molecular & Cellular Proteomics
158 papers in training set
Top 0.8%
2.6%
11
BMC Genomics
328 papers in training set
Top 2%
2.1%
12
mSystems
361 papers in training set
Top 4%
2.1%
13
Journal of Proteomics
27 papers in training set
Top 0.1%
1.9%
14
Computational and Structural Biotechnology Journal
216 papers in training set
Top 4%
1.9%
15
GigaScience
172 papers in training set
Top 1%
1.7%
16
Bioinformatics Advances
184 papers in training set
Top 3%
1.5%
17
Frontiers in Microbiology
375 papers in training set
Top 7%
1.2%
18
Microbiology Spectrum
435 papers in training set
Top 4%
1.2%
19
Briefings in Bioinformatics
326 papers in training set
Top 5%
1.1%
20
NAR Genomics and Bioinformatics
214 papers in training set
Top 3%
1.1%
21
Database
51 papers in training set
Top 0.7%
1.0%
22
Metabarcoding and Metagenomics
12 papers in training set
Top 0.1%
1.0%
23
Biology
43 papers in training set
Top 2%
0.9%
24
Nature Communications
4913 papers in training set
Top 62%
0.8%
25
Analytical Chemistry
205 papers in training set
Top 2%
0.8%
26
Methods in Ecology and Evolution
160 papers in training set
Top 2%
0.7%
27
Frontiers in Veterinary Science
30 papers in training set
Top 1%
0.5%
28
Ecological Informatics
29 papers in training set
Top 1.0%
0.5%