Back

GROQ-seq Datasets Across Transcription Factors (LacI, RamR, VanR), T7 RNA Polymerase and TEV Protease

Spinner, A.; Sreenivasan, S.; McLellan, J. R.; Ikonomova, S. P.; Cortade, D. L.; dOelsnitz, S.; Sheldon, K.; Vasilyeva, O. B.; Alperovich, N. Y.; Chadha, A.; Nematollahi, L.; Dhroso, A.; Sisson, Z.; Hudson, C. M.; DeBenedictis, E.; Kelly, P. J.; Reider Apel, A.; Ross, D.; Baranowski, C.

2026-04-18 bioengineering
10.64898/2026.04.15.718744 bioRxiv
Show abstract

Predicting any proteins function from its sequence alone would be a significant breakthrough in molecular biology. Although machine learning approaches have sought to tackle this, their limited generalizability reflects the absence of sufficiently large, open, diverse, and unified datasets. To address this data gap, we developed a high-throughput experimental platform called GROQ-seq (Growth-based Quantitative Sequencing). In GROQ-seq, a proteins function can be linked to a sequencing-based readout that enables scalable characterization of large variant libraries in Escherichia coli. Here, we present pilot datasets demonstrating its performance across three distinct protein function classes: transcription factors, polymerases, and proteases. The objective of this report is to present the datasets and to provide users with a clear and transparent characterization of their properties, including both the strengths and limitations.

Matching journals

The top 6 journals account for 50% of the predicted probability mass.

1
Nature Methods
336 papers in training set
Top 0.6%
14.1%
2
Nucleic Acids Research
1128 papers in training set
Top 1%
12.1%
3
Cell Systems
167 papers in training set
Top 2%
6.7%
4
Nature Communications
4913 papers in training set
Top 27%
6.7%
5
PLOS ONE
4510 papers in training set
Top 29%
6.2%
6
PLOS Computational Biology
1633 papers in training set
Top 6%
6.2%
50% of probability mass above
7
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 18%
3.9%
8
Nature Biotechnology
147 papers in training set
Top 3%
3.5%
9
BMC Genomics
328 papers in training set
Top 1%
2.3%
10
Protein Engineering, Design and Selection
14 papers in training set
Top 0.1%
2.3%
11
eLife
5422 papers in training set
Top 36%
2.0%
12
Scientific Reports
3102 papers in training set
Top 51%
2.0%
13
ACS Synthetic Biology
256 papers in training set
Top 2%
1.8%
14
NAR Genomics and Bioinformatics
214 papers in training set
Top 2%
1.7%
15
Frontiers in Molecular Biosciences
100 papers in training set
Top 2%
1.7%
16
GigaScience
172 papers in training set
Top 2%
1.5%
17
BMC Bioinformatics
383 papers in training set
Top 5%
1.3%
18
Journal of Molecular Biology
217 papers in training set
Top 2%
1.2%
19
Molecular Cell
308 papers in training set
Top 9%
0.9%
20
Computational and Structural Biotechnology Journal
216 papers in training set
Top 8%
0.9%
21
Protein Science
221 papers in training set
Top 2%
0.8%
22
Genome Research
409 papers in training set
Top 4%
0.8%
23
Molecular Systems Biology
142 papers in training set
Top 1%
0.8%
24
Bioinformatics
1061 papers in training set
Top 9%
0.8%
25
Cell Reports Methods
141 papers in training set
Top 5%
0.7%
26
International Journal of Molecular Sciences
453 papers in training set
Top 16%
0.7%
27
Science
429 papers in training set
Top 20%
0.7%
28
Advanced Science
249 papers in training set
Top 20%
0.7%
29
Cell Reports
1338 papers in training set
Top 35%
0.7%
30
Cell Genomics
162 papers in training set
Top 8%
0.6%