Back

GenPept-Curated-2025: A Benchmark Dataset for Antimicrobial Peptide Prediction with Homology-Controlled Partitioning

Pham, H. T.; Huynh, B.; Nguyen-Vo, T.-H.

2026-04-29 bioinformatics
10.64898/2026.04.25.720793 bioRxiv
Show abstract

Antimicrobial peptides (AMPs) are promising therapeutic candidates against rising antimicrobial resistance, yet progress in AMP prediction is hampered by the lack of benchmark datasets that address homology leakage, negative set reliability, and distributional diversity. Existing AMP databases, designed as biological repositories, do not enforce the controlled partitioning required for rigorous machine learning evaluation. We present GenPept-Curated-2025, a curated, class-balanced benchmark of 11,000 peptide sequences (5,500 AMP / 5,500 non-AMP) derived from Bacteria, Archaea, and Fungi, and sourced exclusively from GenPept/NCBI Protein. The dataset was constructed through a reproducible pipeline comprising taxonomic scoping, quality control, precursor handling, annotation-based labeling, and Identical Protein Groups (IPG)-based deduplication, with sequence length restricted to 10-200 aa. The AMP proportion varies substantially across length bins (14.2% in [10, 50] aa to 77.1% in [101, 150] aa), identifying length-dependent class imbalance as a distribution shift that benchmarking must account for. The dataset is openly released to support standardized, reproducible, and leakage-free evaluation of AMP prediction models.

Matching journals

The top 8 journals account for 50% of the predicted probability mass.

1
Briefings in Bioinformatics
326 papers in training set
Top 0.4%
10.2%
2
Nucleic Acids Research
1128 papers in training set
Top 2%
9.2%
3
Nature Communications
4913 papers in training set
Top 21%
9.2%
4
Scientific Data
174 papers in training set
Top 0.2%
7.2%
5
Journal of Chemical Information and Modeling
207 papers in training set
Top 1.0%
4.9%
6
Nature Biotechnology
147 papers in training set
Top 2%
4.9%
7
Advanced Science
249 papers in training set
Top 4%
4.2%
8
Scientific Reports
3102 papers in training set
Top 41%
3.1%
50% of probability mass above
9
Bioinformatics Advances
184 papers in training set
Top 2%
3.1%
10
Bioinformatics
1061 papers in training set
Top 6%
3.1%
11
Nature Machine Intelligence
61 papers in training set
Top 1%
2.6%
12
Nature Methods
336 papers in training set
Top 3%
2.6%
13
Cell Systems
167 papers in training set
Top 6%
2.1%
14
PLOS ONE
4510 papers in training set
Top 49%
1.9%
15
PLOS Computational Biology
1633 papers in training set
Top 14%
1.9%
16
Chemical Science
71 papers in training set
Top 0.9%
1.8%
17
Communications Biology
886 papers in training set
Top 8%
1.7%
18
Genome Medicine
154 papers in training set
Top 4%
1.7%
19
mAbs
28 papers in training set
Top 0.3%
1.0%
20
International Journal of Molecular Sciences
453 papers in training set
Top 11%
1.0%
21
Journal of Cheminformatics
25 papers in training set
Top 0.5%
0.9%
22
Protein Science
221 papers in training set
Top 1%
0.9%
23
Cell Reports Methods
141 papers in training set
Top 4%
0.9%
24
Computational and Structural Biotechnology Journal
216 papers in training set
Top 8%
0.8%
25
Frontiers in Immunology
586 papers in training set
Top 7%
0.8%
26
BMC Bioinformatics
383 papers in training set
Top 7%
0.8%
27
NAR Genomics and Bioinformatics
214 papers in training set
Top 4%
0.7%
28
Molecules
37 papers in training set
Top 2%
0.6%
29
Genomics, Proteomics & Bioinformatics
171 papers in training set
Top 7%
0.5%
30
Science Advances
1098 papers in training set
Top 35%
0.5%