Back

CombinGym: a benchmark platform for machine learning-assisted design of combinatorial protein variants

Chen, Y.; Fu, L.; Lu, X.; Li, W.; Gao, Y.; Wang, Y.; Ruan, Z.; Si, T.

2026-03-25 synthetic biology
10.64898/2026.03.24.714074 bioRxiv
Show abstract

Combinatorial mutagenesis is essential for exploring protein sequence-function landscapes in engineering applications. However, while large-scale machine learning benchmarks exist for protein function prediction, they are primarily limited to single-mutant libraries, leaving a critical gap for combinatorial mutagenesis. Here we introduce CombinGym, a benchmarking platform featuring 14 curated combinatorial mutagenesis datasets spanning 9 proteins with diverse functional properties including binding affinity, fluorescence, and enzymatic activities. We evaluated nine machine learning algorithms from five methodological categories (alignment-based, protein language, structure-based, sequence-label, and substitution-based) across multiple prediction tasks, assessing both zero-shot and supervised learning performance using Spearmans {rho} and Normalized Discounted Cumulative Gain metrics. Our analysis reveals the substantial impact of measurement noise and data processing strategies on model performance. By implementing hierarchical dataset splits (0-vs-rest, 1-vs-rest, 2-vs-rest, and 3-vs-rest scenarios), we demonstrate the value of lower-order mutation data for empowering machine learning models to predict higher-order mutant properties. We validated this capacity through both in silico simulation (improving fluorescence brightness of an oxygen-independent fluorescent protein) and experimental validation (engineering enzyme substrate specificity), achieving a substantial increase in specific activity. All datasets, benchmarks, and metrics are available through an interactive website (https://www.combingym.org), facilitating collaborative dataset expansion and model development through integration with automated biofoundry platforms.

Matching journals

The top 6 journals account for 50% of the predicted probability mass.

1
ACS Synthetic Biology
256 papers in training set
Top 0.3%
18.4%
2
Cell Systems
167 papers in training set
Top 0.7%
12.5%
3
Nature Communications
4913 papers in training set
Top 17%
10.3%
4
Chemical Science
71 papers in training set
Top 0.2%
4.8%
5
Nucleic Acids Research
1128 papers in training set
Top 5%
3.9%
6
Protein Engineering, Design and Selection
14 papers in training set
Top 0.1%
3.8%
50% of probability mass above
7
Journal of Chemical Information and Modeling
207 papers in training set
Top 1%
3.5%
8
Protein Science
221 papers in training set
Top 0.5%
3.0%
9
ACS Central Science
66 papers in training set
Top 0.6%
2.7%
10
Angewandte Chemie International Edition
81 papers in training set
Top 1%
2.3%
11
Nature Methods
336 papers in training set
Top 3%
2.3%
12
Journal of the American Chemical Society
199 papers in training set
Top 3%
1.7%
13
Computational and Structural Biotechnology Journal
216 papers in training set
Top 5%
1.7%
14
Journal of Molecular Biology
217 papers in training set
Top 2%
1.6%
15
Communications Chemistry
39 papers in training set
Top 0.4%
1.5%
16
Proteins: Structure, Function, and Bioinformatics
82 papers in training set
Top 0.6%
1.5%
17
Nature Chemical Biology
104 papers in training set
Top 2%
1.5%
18
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 35%
1.5%
19
Science
429 papers in training set
Top 16%
1.3%
20
Synthetic Biology
21 papers in training set
Top 0.1%
1.2%
21
Advanced Science
249 papers in training set
Top 16%
0.9%
22
Nature Biotechnology
147 papers in training set
Top 6%
0.9%
23
eLife
5422 papers in training set
Top 54%
0.9%
24
ACS Catalysis
16 papers in training set
Top 0.2%
0.8%
25
Bioinformatics
1061 papers in training set
Top 10%
0.7%
26
Nature Machine Intelligence
61 papers in training set
Top 4%
0.7%
27
Journal of Chemical Theory and Computation
126 papers in training set
Top 0.9%
0.7%
28
Briefings in Bioinformatics
326 papers in training set
Top 8%
0.6%
29
Structure
175 papers in training set
Top 4%
0.6%
30
The Journal of Physical Chemistry B
158 papers in training set
Top 2%
0.6%