Back

An Energy Landscape Approach to Miniaturizing Enzymes using Protein Language Model Embeddings

Lala, J.; Agrawal, H.; Dong, F.; Wells, J.; Angioletti-Uberti, S.

2026-03-05 bioinformatics
10.64898/2026.03.04.709378 bioRxiv
Show abstract

AO_SCPLOWBSTRACTC_SCPLOWWe present a general approach to find amino acid sequences corresponding to the most compact enzyme likely to retain the structure of a given catalytic site. Our approach is based on using Monte Carlo (MC) simulations to sample an energy landscape where minima correspond, by construction, to sequences with the aforementioned properties. Building on previous work (Wu et al., 2025) and with the BAGEL package (Lala et al., 2025), we implement a route to achieve this goal using only the information extracted from a protein language model (PLM), without structural information. After generating a set of candidate sequences with this PLM-guided BAGEL optimization, we further filter potential candidates for downstream experimental validation using a two-stage protocol. First, deep-learning-based structure prediction models (ESMFold, Chai-1, Boltz-2) are used to identify a structural consensus among designs with highly conserved active-site geometries, yielding many candidates with active-site RMSD below a few angstroms relative to the wild-type and pLDDT scores above 80. Second, molecular dynamics simulations are performed on a filtered subset of sequences (based on active-site RMSD and SolubleMPNN log-likelihoods) to evaluate active-site stability when including thermal fluctuations. For the most promising enzymes, these yield RMSF values in the active site below 1.0 [A] and an active-site RMSD drift between 0.5 and 1.5 [A], making these mini-variants comparable to the wild type, though outcomes vary across enzymes. Given the protocols generality, we believe these results represent a step forward in AI-guided enzyme design. To facilitate rapid experimental validation by the broader community, we open-source all sequences generated by our computational pipeline. These include designs for four representative enzymes of this study: PETase, subtilisin Carlsberg (serine protease), Taq DNA polymerase, and VioA.

Matching journals

The top 4 journals account for 50% of the predicted probability mass.

1
Bioinformatics
1061 papers in training set
Top 1%
19.4%
2
Journal of Chemical Information and Modeling
207 papers in training set
Top 0.5%
12.5%
3
Bioinformatics Advances
184 papers in training set
Top 0.1%
10.1%
4
Protein Science
221 papers in training set
Top 0.1%
10.1%
50% of probability mass above
5
PLOS Computational Biology
1633 papers in training set
Top 4%
8.4%
6
Journal of Chemical Theory and Computation
126 papers in training set
Top 0.2%
4.8%
7
Journal of Cheminformatics
25 papers in training set
Top 0.1%
4.8%
8
Proteins: Structure, Function, and Bioinformatics
82 papers in training set
Top 0.3%
2.4%
9
BMC Bioinformatics
383 papers in training set
Top 4%
2.4%
10
Biophysical Journal
545 papers in training set
Top 3%
1.7%
11
Computational and Structural Biotechnology Journal
216 papers in training set
Top 5%
1.5%
12
Protein Engineering, Design and Selection
14 papers in training set
Top 0.1%
1.3%
13
PeerJ
261 papers in training set
Top 9%
1.3%
14
Journal of Computational Chemistry
11 papers in training set
Top 0.1%
1.2%
15
Molecular Biology and Evolution
488 papers in training set
Top 3%
1.2%
16
Journal of Molecular Biology
217 papers in training set
Top 2%
1.2%
17
Genome Biology and Evolution
280 papers in training set
Top 1%
1.1%
18
Frontiers in Molecular Biosciences
100 papers in training set
Top 3%
0.9%
19
Briefings in Bioinformatics
326 papers in training set
Top 6%
0.9%
20
The Journal of Physical Chemistry B
158 papers in training set
Top 2%
0.7%
21
Nature Communications
4913 papers in training set
Top 63%
0.7%
22
International Journal of Molecular Sciences
453 papers in training set
Top 18%
0.6%