Back

High Diversity Gene Libraries Facilitate Machine Learning Guided Exploration of Fluorescent Protein Sequence Space

Benabbas, A.; Kearns, P.; Billo, A.; Chisholm, L. O.; Plesa, C.

2026-03-02 synthetic biology
10.64898/2026.03.01.706892 bioRxiv
Show abstract

While protein language models (PLMs) have shown great promise for protein design, their performance is fundamentally constrained by the diversity and completeness of available training data. In particular, PLMs often struggle to extrapolate to sequences that fall outside the distribution spanned by their training sets, limiting their ability to discover proteins in sparsely sampled regions of sequence space. Here we test the hypothesis that experimentally expanding training diversity can convert extrapolation into interpolation and thereby enable discovery of functional sequences beyond natural protein manifolds. Using large-scale gene synthesis and DNA shuffling, we generate libraries that span a broad region of fluorescent protein sequence space and create chimeric variants that bridge between distant homologs. Functional screening for blue fluorescence yields thousands of active variants distributed across diverse sequence lineages. Fine-tuning ProtGPT2 on this expanded dataset enables generation of diverse fluorescent proteins, including designs that extend beyond the regions occupied by known natural sequences while retaining function. This work illustrates how synthetic approaches can help address key limitations in machine learning-guided protein design, especially for small or sparsely populated protein families, by actively creating novel sequences across unexplored but functional regions of sequence space.

Matching journals

The top 3 journals account for 50% of the predicted probability mass.

1
Science
429 papers in training set
Top 0.1%
36.6%
2
Cell Systems
167 papers in training set
Top 0.9%
12.0%
3
Nature
575 papers in training set
Top 4%
8.1%
50% of probability mass above
4
Nature Communications
4913 papers in training set
Top 34%
4.7%
5
Nature Biotechnology
147 papers in training set
Top 2%
4.2%
6
Cell
370 papers in training set
Top 7%
3.5%
7
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 23%
3.1%
8
Journal of the American Chemical Society
199 papers in training set
Top 2%
2.6%
9
Nature Methods
336 papers in training set
Top 3%
2.6%
10
ACS Synthetic Biology
256 papers in training set
Top 1%
2.3%
11
Neuron
282 papers in training set
Top 5%
2.0%
12
Chemical Science
71 papers in training set
Top 0.8%
1.8%
13
Nature Chemical Biology
104 papers in training set
Top 2%
1.6%
14
eLife
5422 papers in training set
Top 52%
0.9%
15
Cell Chemical Biology
81 papers in training set
Top 3%
0.9%
16
Nature Machine Intelligence
61 papers in training set
Top 3%
0.9%
17
Cell Genomics
162 papers in training set
Top 6%
0.9%
18
Nature Structural & Molecular Biology
218 papers in training set
Top 5%
0.7%
19
Science Advances
1098 papers in training set
Top 31%
0.7%
20
ACS Central Science
66 papers in training set
Top 2%
0.7%
21
Nucleic Acids Research
1128 papers in training set
Top 20%
0.6%
22
Angewandte Chemie International Edition
81 papers in training set
Top 4%
0.6%
23
Advanced Science
249 papers in training set
Top 22%
0.6%