Back

Steering Sequence Generation in Protein Language Models through Iterative Lookback Monte Carlo Sampling

Calvanese, F.; Lombardi, G.; Weigt, M.; FERNANDEZ-DE-COSSIO-DIAZ, J.

2026-05-07 bioinformatics
10.64898/2026.05.01.722156 bioRxiv
Show abstract

Protein language models (pLMs) leverage large-scale evolutionary data to generate novel sequences, but steering generation toward desired physicochemical properties without sacrificing diversity remains a major challenge. Existing approaches often induce severe diversity loss or require computationally expensive retraining. We introduce Iterative Lookback Monte Carlo (ILMC), a training-free inference-time sampling strategy that interleaves autoregressive elongation with Metropolis-Hastings refinement to approximate sampling from a maximum-entropy target distribution balancing generative quality and steering objectives. We show theoretically that this target distribution is entropy-maximizing under fixed generative quality and steering constraints, and empirically that ILMC produces more diverse samples than standard autoregressive baselines at matched generative quality. Using simple steering potentials, ILMC improves desired molecular properties, including generating proteins with up to 12{degrees}C higher predicted melting temperature than compute-matched alternative strategies. ILMC naturally applies to classifier-guided steering, where it outperforms purely autoregressive guidance in diversity while maintaining comparable enrichment of target properties. We validate ILMC on family-specific pLMs and on the multi-family model ProGen3.

Matching journals

The top 4 journals account for 50% of the predicted probability mass.

1
Cell Systems
167 papers in training set
Top 0.1%
26.7%
2
Nature Methods
336 papers in training set
Top 0.9%
10.4%
3
Nature Biotechnology
147 papers in training set
Top 0.9%
8.7%
4
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 13%
5.0%
50% of probability mass above
5
Nature Communications
4913 papers in training set
Top 37%
4.1%
6
Bioinformatics
1061 papers in training set
Top 5%
4.1%
7
Science
429 papers in training set
Top 8%
3.7%
8
Nature Computational Science
50 papers in training set
Top 0.2%
3.4%
9
PLOS Computational Biology
1633 papers in training set
Top 12%
2.7%
10
Molecular Biology and Evolution
488 papers in training set
Top 2%
2.2%
11
Nature Machine Intelligence
61 papers in training set
Top 1%
2.1%
12
Genome Research
409 papers in training set
Top 2%
1.7%
13
Genome Biology
555 papers in training set
Top 4%
1.7%
14
The American Journal of Human Genetics
206 papers in training set
Top 2%
1.7%
15
Nature Genetics
240 papers in training set
Top 5%
1.4%
16
Nature
575 papers in training set
Top 13%
1.1%
17
Briefings in Bioinformatics
326 papers in training set
Top 5%
1.0%
18
Scientific Reports
3102 papers in training set
Top 68%
1.0%
19
PLOS ONE
4510 papers in training set
Top 62%
1.0%
20
Genetics
225 papers in training set
Top 3%
1.0%
21
Nucleic Acids Research
1128 papers in training set
Top 15%
0.9%
22
Bioinformatics Advances
184 papers in training set
Top 4%
0.8%
23
Journal of Chemical Information and Modeling
207 papers in training set
Top 3%
0.8%
24
eLife
5422 papers in training set
Top 57%
0.8%
25
BMC Bioinformatics
383 papers in training set
Top 7%
0.8%
26
ACS Synthetic Biology
256 papers in training set
Top 4%
0.5%
27
Biophysical Journal
545 papers in training set
Top 6%
0.5%