Back

Generalise or Memorise? Benchmarking Ligand-Conditioned Protein Generation from Sequence-Only Data

Vicente, A.; Dornfeld, L.; Coines, J.; Ferruz, N.

2026-02-09 bioinformatics
10.64898/2026.02.06.704305 bioRxiv
Show abstract

Proteins can bind small molecules with high specificity. However, designing proteins that bind userdefined ligands remains a challenge, typically relying on structural information and costly experimental iteration. While protein language models (pLMs) have shown promise for unconditional generation and conditioning on coarse functional labels, instance-level conditioning on a specific ligand has not been evaluated using purely textual inputs. Here we frame small-molecule protein binder design as a sequence-to-sequence translation problem and train ligand-conditioned pLMs that map molecular strings to candidate binder sequences. We curate large-scale ligand-protein datasets (>17M ligand-protein pairs) covering different data regimes and train a suite of models, spanning 16 to 700M parameters. Results reveal a consistent trade-off driven by supervision ambiguity: when each ligand is paired with few proteins, models generate near-neighbour, foldable sequences; when each ligand is paired with many proteins, generations are more diverse but less consistently foldable. Our study exposes how annotation diversity and sampling choices elicit this behaviour and how it changes with the data distribution. These insights highlight dataset redundancy and incompleteness as key bottlenecks for sequence-only binder design. We release the curated datasets, trained models, and evaluation tools to support future work on ligand-conditioned protein generation.

Matching journals

The top 7 journals account for 50% of the predicted probability mass.

1
Cell Systems
167 papers in training set
Top 0.4%
18.2%
2
Bioinformatics
1061 papers in training set
Top 3%
8.2%
3
Nature Machine Intelligence
61 papers in training set
Top 0.3%
8.2%
4
Nature Methods
336 papers in training set
Top 2%
4.7%
5
PLOS Computational Biology
1633 papers in training set
Top 8%
4.2%
6
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 21%
3.5%
7
Nature Communications
4913 papers in training set
Top 41%
3.5%
50% of probability mass above
8
Journal of Cheminformatics
25 papers in training set
Top 0.2%
3.5%
9
Briefings in Bioinformatics
326 papers in training set
Top 2%
3.5%
10
Nature Biotechnology
147 papers in training set
Top 3%
3.5%
11
Bioinformatics Advances
184 papers in training set
Top 1%
3.5%
12
Computational and Structural Biotechnology Journal
216 papers in training set
Top 2%
2.7%
13
Journal of Chemical Information and Modeling
207 papers in training set
Top 2%
2.4%
14
Scientific Reports
3102 papers in training set
Top 54%
1.8%
15
Protein Science
221 papers in training set
Top 0.7%
1.8%
16
Chemical Science
71 papers in training set
Top 1%
1.7%
17
Cell Genomics
162 papers in training set
Top 4%
1.5%
18
Science
429 papers in training set
Top 16%
1.3%
19
BMC Bioinformatics
383 papers in training set
Top 6%
1.2%
20
Nature
575 papers in training set
Top 14%
0.9%
21
ACS Synthetic Biology
256 papers in training set
Top 2%
0.9%
22
Proteins: Structure, Function, and Bioinformatics
82 papers in training set
Top 0.7%
0.9%
23
Protein Engineering, Design and Selection
14 papers in training set
Top 0.1%
0.8%
24
mAbs
28 papers in training set
Top 0.3%
0.8%
25
PLOS ONE
4510 papers in training set
Top 69%
0.7%
26
Nucleic Acids Research
1128 papers in training set
Top 19%
0.7%
27
Journal of Chemical Theory and Computation
126 papers in training set
Top 0.9%
0.7%
28
Genome Biology
555 papers in training set
Top 9%
0.6%
29
The Journal of Physical Chemistry B
158 papers in training set
Top 2%
0.6%
30
eLife
5422 papers in training set
Top 62%
0.6%