Structure-derived synthetic sequences guide a protein language model toward metalloproteins

Peteani, G.; Sgueglia, G.; Lemmin, T.; Chino, M.

2026-05-05 bioinformatics

10.64898/2026.04.30.722007 bioRxiv

Show abstract

MotivationProtein language models (pLMs) capture evolutionary sequence constraints but are limited in modeling underrepresented functional classes due to training data imbalance. Metalloproteins constitute a fundamental but sparsely represented class in sequence databases. We therefore assess whether structure-conditioned synthetic sequences can be used to specialize pLMs toward metal-binding functionality. ResultsWe fine-tuned the generalist model ProtGPT2 on synthetic sequences generated by the inverse-folding model ProteinMPNN, constructing training sets with controlled variation in size and diversity. Fine-tuning increased recovery of canonical metal-binding motifs from 43% in the baseline model to 91% in the fine-tuned models. Generated sequences retained high predicted structural confidence and structural similarity to known folds, despite low sequence identity. Analysis of latent representations from ProtGPT2 indicated that fine-tuned models occupy distinct regions of embedding space relative to both the baseline model and structure-conditioned sequences, consistent with partial incorporation of structural constraints while preserving sequence diversity. A multi-step filtering pipeline applied to sequences lacking canonical motifs identified candidate metal-binding sites in four-helical bundle topologies not detected in a non-redundant subset of Protein Data Bank structures or in AlphaFold-predicted proteomes. Availability and implementationCode, trained models, and datasets are available at: https://doi.org/10.5281/zenodo.18672158 and https://huggingface.co/gsgueglia.

Structure-derived synthetic sequences guide a protein language model toward metalloproteins

Matching journals