Back

PocketBagger: Generalizable pocket druggability prediction via positive-unlabeled learning

Gingrich, P. W.; Biswas, A.; Mica, I. L.; Brammer, K. M.; Shu, Z.; Maxwell, D. S.; Russell, K. P.; Al-Lazikani, B.

2026-05-19 bioinformatics
10.64898/2026.05.15.725505 bioRxiv
Show abstract

Abstract SummaryReliable structure-based prediction of small-molecule druggability is hindered by a fundamental labeling problem. Experimentally confirmed liganded sites (positives) are observable, but credible "undruggable" pockets (negatives) are almost impossible to define. Standard supervised machine learning consequently relies on arbitrary definitions of undruggable, leading to bias and false negatives. Here we introduce PocketBagger, a positive-unlabeled (PU) learning framework for pocket druggability prediction trained exclusively on experimentally determined Protein Data Bank1 (PDB) structures. PocketBagger uses PU bagging to learn key features associated with reliable druggable pockets and considers all remaining pockets in the structurally characterized proteome as unlabeled. We demonstrate the capability of PocketBagger through the training of a simple Random Forest classifier and demonstrate its power in recall (0.804), even when challenged with increasingly difficult generalizability assessments and entire protein-family hold outs. We benchmark and demonstrate the added value of PU learning by comparing PocketBagger to a leading deep-learning predictor. However, PocketBagger is intended to be used as a framework for any model architecture. Along with the code, the data generated by PocketBagger are deployed in canSAR.ai, providing scalable, generalizable pocket druggability predictions to the drug discovery community.

Matching journals

The top 4 journals account for 50% of the predicted probability mass.

1
Bioinformatics
1061 papers in training set
Top 0.8%
28.3%
2
Cell Systems
167 papers in training set
Top 1.0%
10.3%
3
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 7%
9.3%
4
Nature Communications
4913 papers in training set
Top 35%
4.4%
50% of probability mass above
5
Nature Machine Intelligence
61 papers in training set
Top 0.9%
3.7%
6
Journal of Cheminformatics
25 papers in training set
Top 0.1%
3.7%
7
PLOS Computational Biology
1633 papers in training set
Top 9%
3.7%
8
Bioinformatics Advances
184 papers in training set
Top 2%
3.1%
9
Protein Science
221 papers in training set
Top 0.5%
2.8%
10
Nature Methods
336 papers in training set
Top 3%
2.7%
11
Scientific Reports
3102 papers in training set
Top 49%
2.1%
12
Journal of Chemical Information and Modeling
207 papers in training set
Top 2%
2.1%
13
Proteins: Structure, Function, and Bioinformatics
82 papers in training set
Top 0.5%
1.7%
14
BMC Bioinformatics
383 papers in training set
Top 5%
1.5%
15
Briefings in Bioinformatics
326 papers in training set
Top 5%
1.4%
16
Journal of Molecular Biology
217 papers in training set
Top 2%
1.3%
17
Nature Biotechnology
147 papers in training set
Top 6%
1.1%
18
PLOS ONE
4510 papers in training set
Top 63%
0.9%
19
Communications Biology
886 papers in training set
Top 18%
0.9%
20
Nature
575 papers in training set
Top 14%
0.9%
21
Science
429 papers in training set
Top 19%
0.8%
22
NAR Genomics and Bioinformatics
214 papers in training set
Top 4%
0.8%
23
Patterns
70 papers in training set
Top 3%
0.7%
24
mAbs
28 papers in training set
Top 0.4%
0.7%
25
Journal of Proteome Research
215 papers in training set
Top 2%
0.7%
26
Protein Engineering, Design and Selection
14 papers in training set
Top 0.1%
0.7%