Back

CROWN: Curated Repository Of Well-resolved Noncovalent interactions

Poelmans, R.; Van Eynde, W.; Bruncsics, B.; Bruncsics, B.; Arany, A.; Moreau, Y.; Voet, A. R.

2026-04-01 bioinformatics
10.64898/2026.03.30.714168 bioRxiv
Show abstract

AbstractThe development of machine learning models for protein-ligand interactions is fundamentally constrained by the quality and diversity of available structural data. Existing databases of protein-ligand complexes present researchers with an unsatisfying trade-off: carefully curated collections such as PDBBind and HiQBind offer high structural reliability but cover only a narrow slice of the Protein Data Bank (PDB), while large-scale resources like PLInder provide broad coverage at the expense of rigorous quality control. Here, we introduce CROWN (Curated Repository Of Well-resolved Non-covalent interactions), a machine learning-ready dataset that reconciles this tension by applying a comprehensive, fully automated preprocessing pipeline to the PLInder database. Starting from 649,915 protein-ligand interaction systems, CROWN applies a series of interleaved quality filters and processing stages addressing crystallographic resolution, ligand identity, pocket completeness, structural repair, interaction quality, and protonation at physiological pH. A distinguishing feature of the pipeline is a final constrained energy minimisation step using custom flat-bottomed restraints, which balances crystallographic evidence with relaxation of intramolecular strain. This step -- absent from existing protein-ligand datasets -- produces structurally uniform complexes by reconciling the heterogeneous refinement practices of different crystallographers and structure determination protocols, without distorting the experimentally observed binding geometry. The resulting dataset of 153,005 complexes represents a roughly four-fold increase in protein and species diversity over PDBBind and HiQBind, while maintaining rigorous structural standards. Importantly, CROWN adopts a geometry-centric design philosophy that treats the 3D arrangement of atoms at the binding interface as a self-consistent source of information, rather than relying on externally measured binding affinities that cover only a fraction of known structures and introduce well-documented biases. We anticipate that CROWN will serve as a broadly useful resource for training generative models of protein-ligand binding poses, developing scoring functions, and benchmarking interaction prediction methods.

Matching journals

The top 5 journals account for 50% of the predicted probability mass.

1
Bioinformatics
1061 papers in training set
Top 1%
18.8%
2
Protein Science
221 papers in training set
Top 0.1%
10.1%
3
Structure
175 papers in training set
Top 0.2%
8.4%
4
Nature Communications
4913 papers in training set
Top 26%
6.8%
5
Nature Methods
336 papers in training set
Top 2%
6.3%
50% of probability mass above
6
Journal of Molecular Biology
217 papers in training set
Top 0.5%
4.0%
7
Bioinformatics Advances
184 papers in training set
Top 1%
3.6%
8
Acta Crystallographica Section D Structural Biology
54 papers in training set
Top 0.1%
3.6%
9
Cell Systems
167 papers in training set
Top 5%
2.7%
10
Journal of Cheminformatics
25 papers in training set
Top 0.2%
2.6%
11
Journal of Structural Biology
58 papers in training set
Top 0.5%
2.1%
12
Nucleic Acids Research
1128 papers in training set
Top 9%
2.1%
13
Journal of Chemical Information and Modeling
207 papers in training set
Top 2%
1.7%
14
Nature Biotechnology
147 papers in training set
Top 4%
1.7%
15
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 33%
1.7%
16
PLOS ONE
4510 papers in training set
Top 57%
1.5%
17
Scientific Reports
3102 papers in training set
Top 62%
1.5%
18
Proteins: Structure, Function, and Bioinformatics
82 papers in training set
Top 0.6%
1.3%
19
Nature
575 papers in training set
Top 13%
1.1%
20
mAbs
28 papers in training set
Top 0.3%
0.8%
21
Communications Biology
886 papers in training set
Top 21%
0.8%
22
PLOS Computational Biology
1633 papers in training set
Top 23%
0.8%
23
Science
429 papers in training set
Top 19%
0.8%
24
Computational and Structural Biotechnology Journal
216 papers in training set
Top 8%
0.8%
25
Cell Reports Methods
141 papers in training set
Top 5%
0.7%
26
eLife
5422 papers in training set
Top 61%
0.6%