Back

AI-assisted improvement of Aspergillus oryzae β-galactosidase using an Ensemble of Protein Language Models

Trapote Fernandez, A.; Fernandez, A.; Mendez-Liter, J. A.; Prieto, A.; Barriuso, J.; Osorio, F. G.

2026-05-21 synthetic biology
10.64898/2026.05.20.726739 bioRxiv
Show abstract

{beta}-galactosidases (BGs) are essential enzymes widely used in the food industry, particularly in the production of lactose-free products. Among them, the BG from Aspergillus oryzae is of industrial relevance due to its activity at acidic pH and moderate thermal tolerance. However, enhancing its catalytic performance remains a key challenge. Traditional enzyme engineering methods are time-consuming and resource-intensive, limiting their scalability. Recent advances in Artificial Intelligence (AI), particularly those based on Natural Language Processing, offer a promising alternative by enabling efficient exploration of protein sequence space and prediction of beneficial mutations. In this study, we introduce an ensemble-based, zero-shot Protein Language Model pipeline that reconciles predictions from six independent models (ESM2 and the five ESM1v variants) combined with a diversity-aware candidate selection strategy. Applied to the BG from A. oryzae, this approach identified beneficial mutations leading to novel enzyme variants with up to a four-fold increase in catalytic efficiency on oNPGal, a two-fold increase on lactose, and, independently, a T338I variant with markedly enhanced thermostability ({approx}80% residual activity after 24 h at 60 {degrees}C), all without requiring supervised fine-tuning on experimental fitness data. Our results demonstrate that consensus across an ensemble of PLMs can efficiently enrich beneficial substitutions in industrially relevant enzymes and substantially reduce the number of wet-lab candidates that need to be screened. Table of Contents graphic O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=106 SRC="FIGDIR/small/726739v1_ufig1.gif" ALT="Figure 1"> View larger version (29K): org.highwire.dtl.DTLVardef@18084f7org.highwire.dtl.DTLVardef@99a102org.highwire.dtl.DTLVardef@19a64forg.highwire.dtl.DTLVardef@1f59cff_HPS_FORMAT_FIGEXP M_FIG C_FIG

Matching journals

The top 6 journals account for 50% of the predicted probability mass.

1
ACS Synthetic Biology
256 papers in training set
Top 0.2%
26.4%
2
Computational and Structural Biotechnology Journal
216 papers in training set
Top 0.2%
9.3%
3
Metabolic Engineering Communications
20 papers in training set
Top 0.1%
4.9%
4
Microbial Cell Factories
22 papers in training set
Top 0.1%
4.4%
5
Frontiers in Bioengineering and Biotechnology
88 papers in training set
Top 0.4%
4.0%
6
ACS Omega
90 papers in training set
Top 0.5%
3.7%
50% of probability mass above
7
Synthetic Biology
21 papers in training set
Top 0.1%
3.7%
8
Metabolic Engineering
68 papers in training set
Top 0.2%
3.7%
9
Synthetic and Systems Biotechnology
10 papers in training set
Top 0.1%
3.1%
10
Protein Engineering, Design and Selection
14 papers in training set
Top 0.1%
2.5%
11
International Journal of Molecular Sciences
453 papers in training set
Top 4%
2.4%
12
Biotechnology and Bioengineering
49 papers in training set
Top 0.3%
1.9%
13
Nature Communications
4913 papers in training set
Top 51%
1.7%
14
ACS Catalysis
16 papers in training set
Top 0.1%
1.7%
15
Journal of Chemical Information and Modeling
207 papers in training set
Top 2%
1.7%
16
Briefings in Bioinformatics
326 papers in training set
Top 4%
1.5%
17
Scientific Reports
3102 papers in training set
Top 68%
1.0%
18
PLOS ONE
4510 papers in training set
Top 63%
0.9%
19
Chemical Science
71 papers in training set
Top 2%
0.9%
20
Frontiers in Plant Science
240 papers in training set
Top 4%
0.9%
21
Frontiers in Molecular Biosciences
100 papers in training set
Top 4%
0.9%
22
Biology Methods and Protocols
53 papers in training set
Top 2%
0.8%
23
Microbial Biotechnology
29 papers in training set
Top 0.8%
0.8%
24
iScience
1063 papers in training set
Top 31%
0.8%
25
Protein Science
221 papers in training set
Top 2%
0.8%
26
PLOS Computational Biology
1633 papers in training set
Top 24%
0.8%
27
International Journal of Biological Macromolecules
65 papers in training set
Top 3%
0.8%
28
Bioinformatics
1061 papers in training set
Top 9%
0.8%
29
eLife
5422 papers in training set
Top 59%
0.7%
30
Journal of Molecular Biology
217 papers in training set
Top 4%
0.7%