Back

Protein Language Model Decoys for Target Decoy Competition in Proteomics: Quality Assessment and Benchmarks

Reznikov, G.; Kusters, F.; Mohammadi, M.; van den Toorn, H. W. P.; Sinitcyn, P.

2026-03-31 bioinformatics
10.64898/2026.03.27.714819 bioRxiv
Show abstract

Large-scale proteomics relies heavily on target-decoy competition for false discovery rate estimation in peptide identification, and the performance of this strategy depends strongly on the design of the decoy database. Classical generators such as reversal and shuffling remain widely used. Here, we introduce protein language model-based (PLM) decoy generation for peptide identification and benchmark it against classical strategies. We evaluate these approaches using three complementary quality-control layers: sequence-based separability, search-engine-agnostic spectral-space diagnostics, and end-to-end mass spectrometry benchmarks, including pipelines with rescoring. Across these analyses, PLM-based decoys are harder for sequence-only neural networks to distinguish than most classical generators, suggesting fewer obvious sequence-level artifacts. However, this signal is only weakly informative for search performance. Spectral diagnostics further show that short peptides occupy a particularly crowded target-decoy space and are therefore especially prone to local collisions across all generators. In full search pipelines, reverse decoys remain a strong baseline, and current PLM-based generators do not yet provide a clear overall advantage. We therefore view PLM-based decoys not as universal replacements for reverse decoys, but as tunable tools for benchmarking, diagnostics, stress testing, and future adaptive decoy optimization, with increasing value as search models become more expressive.

Matching journals

The top 4 journals account for 50% of the predicted probability mass.

1
Nature Communications
4913 papers in training set
Top 5%
19.2%
2
Molecular & Cellular Proteomics
158 papers in training set
Top 0.1%
14.7%
3
Journal of Proteome Research
215 papers in training set
Top 0.3%
10.4%
4
Cell Systems
167 papers in training set
Top 2%
7.4%
50% of probability mass above
5
Nature Methods
336 papers in training set
Top 2%
6.5%
6
Nature Machine Intelligence
61 papers in training set
Top 0.4%
6.5%
7
Bioinformatics
1061 papers in training set
Top 5%
3.7%
8
PLOS ONE
4510 papers in training set
Top 42%
3.2%
9
PLOS Computational Biology
1633 papers in training set
Top 13%
2.1%
10
eLife
5422 papers in training set
Top 40%
1.7%
11
Analytical Chemistry
205 papers in training set
Top 2%
1.4%
12
Molecular Systems Biology
142 papers in training set
Top 1%
1.0%
13
Peer Community Journal
254 papers in training set
Top 3%
1.0%
14
Genome Biology
555 papers in training set
Top 6%
1.0%
15
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 39%
1.0%
16
Nature Biotechnology
147 papers in training set
Top 6%
1.0%
17
Communications Biology
886 papers in training set
Top 18%
0.9%
18
PROTEOMICS
35 papers in training set
Top 0.7%
0.8%
19
ACS Nano
99 papers in training set
Top 3%
0.8%
20
Nature Chemical Biology
104 papers in training set
Top 3%
0.8%
21
Nucleic Acids Research
1128 papers in training set
Top 18%
0.7%
22
Advanced Science
249 papers in training set
Top 21%
0.7%
23
Cell Reports Methods
141 papers in training set
Top 7%
0.5%
24
Patterns
70 papers in training set
Top 3%
0.5%