Back

Structure-derived synthetic sequences guide a protein language model toward metalloproteins

Peteani, G.; Sgueglia, G.; Lemmin, T.; Chino, M.

2026-05-05 bioinformatics
10.64898/2026.04.30.722007 bioRxiv
Show abstract

MotivationProtein language models (pLMs) capture evolutionary sequence constraints but are limited in modeling underrepresented functional classes due to training data imbalance. Metalloproteins constitute a fundamental but sparsely represented class in sequence databases. We therefore assess whether structure-conditioned synthetic sequences can be used to specialize pLMs toward metal-binding functionality. ResultsWe fine-tuned the generalist model ProtGPT2 on synthetic sequences generated by the inverse-folding model ProteinMPNN, constructing training sets with controlled variation in size and diversity. Fine-tuning increased recovery of canonical metal-binding motifs from 43% in the baseline model to 91% in the fine-tuned models. Generated sequences retained high predicted structural confidence and structural similarity to known folds, despite low sequence identity. Analysis of latent representations from ProtGPT2 indicated that fine-tuned models occupy distinct regions of embedding space relative to both the baseline model and structure-conditioned sequences, consistent with partial incorporation of structural constraints while preserving sequence diversity. A multi-step filtering pipeline applied to sequences lacking canonical motifs identified candidate metal-binding sites in four-helical bundle topologies not detected in a non-redundant subset of Protein Data Bank structures or in AlphaFold-predicted proteomes. Availability and implementationCode, trained models, and datasets are available at: https://doi.org/10.5281/zenodo.18672158 and https://huggingface.co/gsgueglia.

Matching journals

The top 2 journals account for 50% of the predicted probability mass.

1
Bioinformatics
1061 papers in training set
Top 0.3%
43.6%
2
Protein Science
221 papers in training set
Top 0.1%
8.8%
50% of probability mass above
3
Cell Systems
167 papers in training set
Top 2%
6.6%
4
Nature Communications
4913 papers in training set
Top 32%
5.1%
5
Bioinformatics Advances
184 papers in training set
Top 0.8%
4.5%
6
PLOS Computational Biology
1633 papers in training set
Top 9%
3.8%
7
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 24%
2.9%
8
Nature Methods
336 papers in training set
Top 4%
2.0%
9
Scientific Reports
3102 papers in training set
Top 52%
2.0%
10
BMC Bioinformatics
383 papers in training set
Top 4%
1.8%
11
Computational and Structural Biotechnology Journal
216 papers in training set
Top 4%
1.8%
12
Science
429 papers in training set
Top 14%
1.7%
13
NAR Genomics and Bioinformatics
214 papers in training set
Top 2%
1.6%
14
Briefings in Bioinformatics
326 papers in training set
Top 4%
1.4%
15
Journal of Cheminformatics
25 papers in training set
Top 0.5%
0.9%
16
Nature Biotechnology
147 papers in training set
Top 6%
0.9%
17
Communications Biology
886 papers in training set
Top 22%
0.8%
18
Nature Machine Intelligence
61 papers in training set
Top 3%
0.8%
19
Journal of Molecular Biology
217 papers in training set
Top 3%
0.8%
20
Advanced Science
249 papers in training set
Top 21%
0.7%
21
Journal of Chemical Information and Modeling
207 papers in training set
Top 3%
0.5%
22
Molecular Systems Biology
142 papers in training set
Top 2%
0.5%