Back

Improving Biological Sequence Prediction with AlphaFold2 Representation

Jiang, Z.; Nguyen, C. H.; Mamitsuka, H.

2026-04-28 bioinformatics
10.64898/2026.04.26.720550 bioRxiv
Show abstract

MotivationAccurate prediction of functional sites from primary sequences is essential for elucidating biological mechanisms and advancing rational drug design. However, traditional sequence-based features are inherently unable to capture complex structural protein contexts. Recently, AlphaFold2 (AF2) revolutionized protein structure prediction, raising expectations of AF2 to serve as a feature extractor providing structure-rich representation, which can be useful for sequence-based prediction, particularly for unknown sequences. ResultsWe present a novel feature-engineering paradigm that leverages a high-dimensional latent representation matrix (of L x D, where L is the sequence length and D is the feature dimension size) extracted directly from the AF2 Evoformer module. We systematically evaluated the AF2 representation, comparing with conventional sequence-based features, such as hidden Markov model profiles, using a variety of machine learning models, on two structurally contrasting tasks, calpain cleavage site and nucleic-acid-binding site prediction. The AF2 representation outperformed conventional sequence-based features clearly and entirely, particularly for targets with low sequence homology to training data. Furthermore, interpretability analyses, using SHapley Additive exPlanations (SHAP) and Uniform MAnifoldapproximation and Projection (UMAP), showed more details behind the performance advantage of AF2 representation through feature importance ranking and visualization. Overall, these empirical results confirmed that AF2 representation could effectively bridge the sequence-to-structure gap as a feature input for sequence prediction, without increasing heavy computational burden. Availability and implementationSource code, pre-trained models, and datasets are freely available to non-commercial users at https://github.com/Lili-irtyd/Improve-biological-sequences-prediction-by-AlphaFold2. Contactmami@kuicr.kyoto-u.ac.jp

Matching journals

The top 1 journal accounts for 50% of the predicted probability mass.