Back

Machine Learning-Assisted Pathway Optimization in Large Combinatorial Design Spaces: a p-Coumaric Acid Case Study

Abeel, T.; van Lent, P.; Hoek, R. v. d.; Schmitz, J.; Paz, S. M.; Kooi, I.; Jonkers, M.; Zwartjens, P.

2025-06-18 bioengineering
10.1101/2025.06.13.659482 bioRxiv
Show abstract

Combinatorial pathway optimization is an important tool for industrial metabolic engineering to improve titer, yield, or productivity of strains. Machine learning has been increasingly applied on many aspects of the Design-Build-Test-Learn (DBTL) cycle, an engineering framework that aims to navigate through the large landscape of theoretically possible designs using an iterative approach. While machine learning-assisted recommendation strategies have been successfully used to optimize strains, they have so far been limited to relatively small design spaces with few targeted elements. This small design space may limit key strengths of these approaches, such as strong predictive capabilities of supervised machine learning and exploration-exploitation schemes widely used in reinforcement learning and Bayesian optimization. In this work, two DBTL cycles are performed on Saccharomyces cerevisiae for p-coumaric acid production. We first perform a large library transformation on eighteen genes with twenty promoters, which expands the size of the combinatorial design space significantly (approximately 170 million configurations), followed by a smaller model-guided recommendation round. We use a machine learning-assisted recommendation strategy, based on the gradient bandit algorithm, parametrized to balance explo- ration and exploitation. We show that our recommendation strategy has a better performance than strain recommendation strategy using greedy strategies, such as feature importance-based methods. While balancing between exploration and exploitation has been shown to be impor- tant in many applications, we provide the first direct experimental illustration of this effect by recommending strains for scenarios with increasing exploitative-ness. A clear effect of the exploration-exploitation scenario on the p-coumaric acid production distribution of strains is observed, where a balanced scenario shows a higher variation in production over an exploratory or exploitative scenario. Interestingly, using an alternative top-producing parent strain with this balanced exploration-exploitation scheme gives the highest p-coumaric acid production, suggest- ing that model predictions outside of the training data distribution can still be used to perform successful strain recommendation. Overall, these results suggest that using machine learning- assisted strategies with balanced exploration-exploitation can be used to efficiently explore large combinatorial design spaces. The best engineered strain shows an increase in p-coumaric acid production of 137% over the parent strains and a 0.07g/g yield on glucose.

Matching journals

The top 3 journals account for 50% of the predicted probability mass.

1
Metabolic Engineering
68 papers in training set
Top 0.1%
18.8%
2
ACS Synthetic Biology
256 papers in training set
Top 0.3%
18.7%
3
Biotechnology and Bioengineering
49 papers in training set
Top 0.1%
12.8%
50% of probability mass above
4
Metabolic Engineering Communications
20 papers in training set
Top 0.1%
10.2%
5
PLOS Computational Biology
1633 papers in training set
Top 8%
4.3%
6
Nature Communications
4913 papers in training set
Top 35%
4.3%
7
Scientific Reports
3102 papers in training set
Top 31%
4.0%
8
Computational and Structural Biotechnology Journal
216 papers in training set
Top 2%
3.6%
9
PLOS ONE
4510 papers in training set
Top 45%
2.6%
10
Frontiers in Bioengineering and Biotechnology
88 papers in training set
Top 1.0%
2.1%
11
Cell Systems
167 papers in training set
Top 8%
1.5%
12
Microbial Biotechnology
29 papers in training set
Top 0.5%
1.3%
13
Journal of The Royal Society Interface
189 papers in training set
Top 3%
1.2%
14
npj Systems Biology and Applications
99 papers in training set
Top 2%
1.1%
15
Frontiers in Molecular Biosciences
100 papers in training set
Top 4%
0.9%
16
IFAC-PapersOnLine
12 papers in training set
Top 0.1%
0.8%
17
BMC Bioinformatics
383 papers in training set
Top 7%
0.8%
18
mSystems
361 papers in training set
Top 7%
0.8%
19
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 44%
0.8%
20
Frontiers in Microbiology
375 papers in training set
Top 10%
0.6%