Back

Seq2Pocket: Augmenting protein language models for spatially consistent binding site prediction

Skrhak, V.; Polak, L.; Novotny, M.; Hoksza, D.

2026-01-31 bioinformatics
10.64898/2026.01.28.702257 bioRxiv
Show abstract

Protein-ligand binding site prediction (LBS) is important for many domains including computational drug discovery, where, as in other tasks, protein language models (pLMs) have shown a great promise. In their application to LBS, the pLM classifies each amino acid as binding or not. Subsequently, for the purposes of downstream analysis, these predictions are mapped onto the structure, forming structure-continuous pockets. However, their residue-oriented nature often results in spatially fragmented predictions. We present a comprehensive framework (Seq2Pocket) that addresses this by combining finetuned pLM with an embedding-supported smoothing classifier and an optimized clustering strategy. While finetuning on our enhanced scPDB dataset yields state-of-the-art results, outperforming existing predictors by up to 11% in DCC recall, the smoothing classifier restores pocket continuity. Next, we introduce the Pocket Fragmentation Index (PFI) and use it to select a clustering approach that preserves a consistent mapping between predictions and ground-truth pockets. Validated on the LIGYSIS and CryptoBench benchmarks, our approach ensures that pLM-based predictions are not only statistically accurate but also useful for downstream drug discovery, while maintaining state-of-the-art performance.

Matching journals

The top 5 journals account for 50% of the predicted probability mass.

1
Bioinformatics
1061 papers in training set
Top 0.7%
32.6%
2
Bioinformatics Advances
184 papers in training set
Top 0.5%
6.3%
3
Journal of Cheminformatics
25 papers in training set
Top 0.1%
6.2%
4
PLOS Computational Biology
1633 papers in training set
Top 7%
4.8%
5
Cell Systems
167 papers in training set
Top 3%
4.8%
50% of probability mass above
6
Nature Methods
336 papers in training set
Top 2%
4.3%
7
Nucleic Acids Research
1128 papers in training set
Top 5%
3.9%
8
Nature Communications
4913 papers in training set
Top 38%
3.8%
9
Journal of Chemical Information and Modeling
207 papers in training set
Top 1%
3.5%
10
Nature Biotechnology
147 papers in training set
Top 4%
2.1%
11
Scientific Reports
3102 papers in training set
Top 56%
1.8%
12
BMC Bioinformatics
383 papers in training set
Top 5%
1.7%
13
Journal of Molecular Biology
217 papers in training set
Top 2%
1.7%
14
PLOS ONE
4510 papers in training set
Top 55%
1.6%
15
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 35%
1.5%
16
Briefings in Bioinformatics
326 papers in training set
Top 4%
1.5%
17
Protein Science
221 papers in training set
Top 1%
1.2%
18
Communications Biology
886 papers in training set
Top 17%
0.9%
19
eLife
5422 papers in training set
Top 52%
0.9%
20
Computational and Structural Biotechnology Journal
216 papers in training set
Top 8%
0.9%
21
Genome Medicine
154 papers in training set
Top 7%
0.8%
22
Advanced Science
249 papers in training set
Top 19%
0.7%
23
iScience
1063 papers in training set
Top 33%
0.7%
24
Cell Reports Methods
141 papers in training set
Top 5%
0.7%
25
NAR Genomics and Bioinformatics
214 papers in training set
Top 4%
0.7%
26
Chemical Science
71 papers in training set
Top 2%
0.6%
27
Proteins: Structure, Function, and Bioinformatics
82 papers in training set
Top 1%
0.6%