Back

A Systematic Benchmark for Peptide Property Prediction

Dong, X.; Yang, K.; Wu, T.; Li, P.; Gao, L.

2026-02-10 bioinformatics
10.64898/2026.02.09.704773 bioRxiv
Show abstract

Accurate prediction of peptide physicochemical properties and biological activities is critical for rational peptide design and high-throughput screening. However, current research is often constrained by heterogeneous data sources and inconsistent evaluation standards, which hinder fair comparisons and reliable assessments of model generalization. In this work, we present PPB, a peptide property prediction benchmark designed to evaluate model performance with an emphasis on realistic generalization across both classification and regression tasks. By applying unified biological filtering criteria, we systematically curated and standardized 15 datasets comprising 161,571 unique sequences, spanning a wide range of physicochemical properties and functional activities. We benchmarked seven representative architectures--encompassing traditional machine learning, deep learning, and pre-trained language models--alongside diverse feature encoding schemes. Furthermore, we investigated the impact of random versus homology-based (sequence similarity) data splitting strategies on model robustness. To facilitate community access, we developed the PPB web server (http://ppb.molmatrix.com/index.html), which provides centralized resources for standardized dataset downloads, interactive visualization of benchmark results, and detailed evaluation protocols. Author summaryPeptides are short amino acid chains essential for biological functions and drug discovery. While AI models have accelerated peptide property prediction, the field lacks a unified standard to fairly compare these methods, often leading to inconsistent results and overoptimistic performance estimates. In this study, we introduce the Peptide Property Benchmark (PPB), a comprehensive framework featuring 15 standardized datasets and over 160,000 sequences. We systematically evaluated diverse AI paradigms, including traditional machine learning and advanced protein language models. Our results demonstrate that large-scale pre-trained models--the biological equivalent of large language models--offer superior accuracy and stability, particularly for small or complex datasets. Crucially, our analysis reveals a "clustering bottleneck": standard tools used to group proteins based on similarity often fail when applied to short peptides, causing data to fragment excessively. This suggests that traditional strategies for testing model generalization may be less effective for peptides than previously assumed. To facilitate community progress, we provide an online platform for standardized data and evaluation. This work establishes a rigorous foundation for developing more reliable AI tools for the next generation of peptide-based therapeutics.

Matching journals

The top 3 journals account for 50% of the predicted probability mass.

1
Journal of Chemical Information and Modeling
207 papers in training set
Top 0.1%
42.0%
2
Bioinformatics
1061 papers in training set
Top 3%
7.2%
3
Protein Science
221 papers in training set
Top 0.3%
3.8%
50% of probability mass above
4
Bioinformatics Advances
184 papers in training set
Top 1%
3.8%
5
Computational and Structural Biotechnology Journal
216 papers in training set
Top 1%
3.8%
6
Journal of Cheminformatics
25 papers in training set
Top 0.2%
2.8%
7
Journal of Proteome Research
215 papers in training set
Top 0.9%
2.5%
8
Chemical Science
71 papers in training set
Top 0.6%
2.2%
9
PLOS Computational Biology
1633 papers in training set
Top 15%
1.9%
10
Proteins: Structure, Function, and Bioinformatics
82 papers in training set
Top 0.4%
1.8%
11
BMC Bioinformatics
383 papers in training set
Top 4%
1.8%
12
PLOS ONE
4510 papers in training set
Top 56%
1.6%
13
Briefings in Bioinformatics
326 papers in training set
Top 4%
1.6%
14
Journal of Molecular Biology
217 papers in training set
Top 2%
1.3%
15
Journal of Chemical Theory and Computation
126 papers in training set
Top 0.6%
1.3%
16
mAbs
28 papers in training set
Top 0.2%
1.3%
17
Scientific Reports
3102 papers in training set
Top 68%
1.0%
18
Molecules
37 papers in training set
Top 1%
0.9%
19
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 42%
0.8%
20
International Journal of Molecular Sciences
453 papers in training set
Top 13%
0.8%
21
Journal of the American Society for Mass Spectrometry
33 papers in training set
Top 0.5%
0.8%
22
ACS Pharmacology & Translational Science
40 papers in training set
Top 0.9%
0.8%
23
NAR Genomics and Bioinformatics
214 papers in training set
Top 4%
0.8%
24
Journal of Medicinal Chemistry
68 papers in training set
Top 1%
0.7%
25
Nature Machine Intelligence
61 papers in training set
Top 4%
0.7%
26
Artificial Intelligence in the Life Sciences
11 papers in training set
Top 0.4%
0.5%
27
Cell Systems
167 papers in training set
Top 14%
0.5%
28
Advanced Science
249 papers in training set
Top 22%
0.5%
29
Communications Chemistry
39 papers in training set
Top 2%
0.5%