Back

Structural bias in machine learning-guided peptide design

Aldas-Bulos, V. D.; Plisson, F.

2026-05-08 bioinformatics
10.64898/2026.05.06.721805 bioRxiv
Show abstract

Machine learning continues to accelerate peptide and protein design through the rapid prediction and generation of sequences with desired characteristics. Many applications focus on predicting properties, functions, and structures, as well as generating point mutations and de novo designs. Nevertheless, many models prove less generalizable than initially claimed. Most predictors and generators are trained on sequential datasets, where imbalances can be addressed during preprocessing. In contrast, structural bias, a subtype of algorithmic bias arising from uneven representation of structural classes in training datasets, and the limitations of early protein structure predictors have frequently remained undetected and uncorrected. The recent surge in powerful protein structure prediction tools, such as the AlphaFold and RosettaFold series and their variants, now presents opportunities to mitigate this issue. We hypothesize that such structural sampling biases influence the downstream performance of ML models. Using antimicrobial peptides as a case study, we audited the structural biases in 16 state-of-the-art predictors for antimicrobial activity and tested whether structural information constrains their predictions. Our analysis revealed that models explicitly trained on sequential data still produce predictions biased by uneven fold representations and data leakage. These findings highlight the importance of integrating balanced structural data or implementing bias-mitigating strategies to develop agnostic models that maximize bioactive protein discovery and multi-objective optimization.

Matching journals

The top 5 journals account for 50% of the predicted probability mass.

1
Journal of Chemical Information and Modeling
207 papers in training set
Top 0.1%
28.1%
2
PLOS Computational Biology
1633 papers in training set
Top 5%
6.9%
3
Proteins: Structure, Function, and Bioinformatics
82 papers in training set
Top 0.1%
6.4%
4
Journal of Cheminformatics
25 papers in training set
Top 0.1%
4.9%
5
Computational and Structural Biotechnology Journal
216 papers in training set
Top 2%
3.6%
50% of probability mass above
6
Bioinformatics
1061 papers in training set
Top 5%
3.6%
7
Bioinformatics Advances
184 papers in training set
Top 1%
3.6%
8
Scientific Reports
3102 papers in training set
Top 40%
3.1%
9
Protein Science
221 papers in training set
Top 0.5%
2.8%
10
Journal of Chemical Theory and Computation
126 papers in training set
Top 0.4%
2.8%
11
Briefings in Bioinformatics
326 papers in training set
Top 3%
2.1%
12
PLOS ONE
4510 papers in training set
Top 49%
1.9%
13
Nature Communications
4913 papers in training set
Top 52%
1.7%
14
Cell Systems
167 papers in training set
Top 8%
1.5%
15
ImmunoInformatics
11 papers in training set
Top 0.1%
1.4%
16
The Journal of Physical Chemistry B
158 papers in training set
Top 1%
1.4%
17
Chemical Science
71 papers in training set
Top 1%
1.4%
18
BMC Bioinformatics
383 papers in training set
Top 5%
1.4%
19
Frontiers in Immunology
586 papers in training set
Top 5%
1.2%
20
Nature Machine Intelligence
61 papers in training set
Top 3%
1.2%
21
International Journal of Molecular Sciences
453 papers in training set
Top 11%
1.0%
22
NAR Genomics and Bioinformatics
214 papers in training set
Top 3%
0.9%
23
Frontiers in Bioinformatics
45 papers in training set
Top 0.8%
0.8%
24
Computers in Biology and Medicine
120 papers in training set
Top 4%
0.8%
25
mAbs
28 papers in training set
Top 0.3%
0.8%
26
Molecules
37 papers in training set
Top 2%
0.8%
27
Biophysical Journal
545 papers in training set
Top 5%
0.7%
28
Frontiers in Molecular Biosciences
100 papers in training set
Top 6%
0.7%
29
BMC Genomics
328 papers in training set
Top 7%
0.7%
30
Frontiers in Genetics
197 papers in training set
Top 12%
0.5%