Back

Large-scale analysis of ligand binding mode similarities in the PDB using interaction fingerprints

Kunnakkattu, I. R.; Choudhary, P.; Midlik, A.; Fleming, J. R.; Balasubramaniyan, B.; Sasidharan Nair, S.; Velankar, S.

2026-04-21 bioinformatics
10.64898/2026.04.17.719144 bioRxiv
Show abstract

Three-dimensional structures of protein-ligand complexes are essential for insights into the molecular principles that govern ligand recognition and binding. With more than 180,000 ligand-bound entries in the Protein Data Bank (PDB), representing over two million individual complexes, the volume of available structural data offers unprecedented opportunities for large-scale analysis of interaction patterns. Analysis of interaction patterns across the PDB archive can help discover similarities and differences in the binding modes of ligands, assisting in drug discovery. However, large-scale analysis of up-to-date information remains a significant challenge due to the rapid growth of data. Here, we introduce the Extended Connectivity Interaction Fingerprint (ECIFP), an interaction-based fingerprint that simplifies 3D protein-ligand contact information into a fingerprint, while retaining key molecular and chemical features of the interacting fragments. The simpler fingerprint representation of the interaction data makes comparison of millions of protein-ligand complexes tractable. Benchmarking shows that ECIFP outperforms ligand-only Extended Connectivity Fingerprints in identifying similar binding sites across identical protein sequences occupied by chemically diverse ligands. Our analysis showed that similarities calculated using ECIFP can be used to compare macromolecular complexes with similar or different ligands. In this study, we demonstrate two large-scale applications of ECIFP: (1) identification of distinct binding modes for over 9,000 ligands across the entire PDB, and (2) detection of binding-mode similarities among structurally diverse ligands within the same binding site across 48,870 binding sites from over 21,000 proteins.

Matching journals

The top 7 journals account for 50% of the predicted probability mass.

1
Bioinformatics
1061 papers in training set
Top 2%
14.6%
2
Journal of Chemical Information and Modeling
207 papers in training set
Top 0.4%
14.2%
3
Journal of Cheminformatics
25 papers in training set
Top 0.1%
7.1%
4
Nature Communications
4913 papers in training set
Top 33%
4.8%
5
Scientific Reports
3102 papers in training set
Top 28%
4.3%
6
Nucleic Acids Research
1128 papers in training set
Top 6%
3.5%
7
Journal of Molecular Biology
217 papers in training set
Top 0.6%
3.5%
50% of probability mass above
8
Protein Science
221 papers in training set
Top 0.5%
3.0%
9
Nature Methods
336 papers in training set
Top 3%
3.0%
10
PLOS Computational Biology
1633 papers in training set
Top 12%
2.7%
11
Communications Biology
886 papers in training set
Top 4%
2.3%
12
Bioinformatics Advances
184 papers in training set
Top 2%
2.1%
13
Briefings in Bioinformatics
326 papers in training set
Top 3%
1.9%
14
Advanced Science
249 papers in training set
Top 10%
1.8%
15
Computational and Structural Biotechnology Journal
216 papers in training set
Top 4%
1.8%
16
Proteins: Structure, Function, and Bioinformatics
82 papers in training set
Top 0.4%
1.8%
17
Cell Systems
167 papers in training set
Top 7%
1.7%
18
Communications Chemistry
39 papers in training set
Top 0.3%
1.7%
19
NAR Genomics and Bioinformatics
214 papers in training set
Top 2%
1.7%
20
PLOS ONE
4510 papers in training set
Top 57%
1.5%
21
eLife
5422 papers in training set
Top 45%
1.5%
22
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 37%
1.3%
23
Genomics, Proteomics & Bioinformatics
171 papers in training set
Top 5%
1.2%
24
Cell Reports Methods
141 papers in training set
Top 3%
1.2%
25
BMC Bioinformatics
383 papers in training set
Top 7%
0.8%
26
The Journal of Physical Chemistry B
158 papers in training set
Top 2%
0.7%
27
Structure
175 papers in training set
Top 3%
0.7%
28
Genome Medicine
154 papers in training set
Top 9%
0.7%
29
Chemical Science
71 papers in training set
Top 2%
0.6%
30
Nature Biotechnology
147 papers in training set
Top 9%
0.6%