Back

Modeling gene regulatory perturbations via deep learning from high-throughput reporter assays

Venukuttan, R.; Doty, R.; Thomson, A.; Chen, Y.; Li, B.; Duan, Y.; Barrera, A.; Dura, K.; Ko, K.-Y.; Lapp, H.; Reddy, T. E.; Allen, A. S.; Majoros, W. H.

2026-03-31 bioinformatics
10.64898/2026.03.27.714770 bioRxiv
Show abstract

Assessing likely variant effects on phenotypes is of critical importance in diagnostic settings, and while much progress has been made in interpreting genic mutations based on our understanding of coding sequence, noncoding variants can be much more challenging to reliably interpret based on DNA sequence alone. High-throughput reporter assays such as STARR-seq and MPRA have shown utility in experimentally measuring regulatory effects of noncoding variants present in samples but provide no readout for variants not present in the assay inputs. However, whole-genome reporter assays provide copious data that can be used to train predictive models for prioritizing variants not directly observed in the experiment. We describe a retrainable predictive modeling framework, BlueSTARR, for this task, and present results of training several models with this framework on whole-genome STARR-seq data from two cell lines and one drug treatment. Using these models, we uncover a global signature across the human genome consistent with purifying selection against both loss-of-function and gain-of-function regulatory variants, with the latter showing a significant bias consistent with selection against gains of cis regulatory function in closed chromatin proximal to genes. By testing the model on synthetic enhancers with binding motifs for transcription factors GR and AP-1, we find that when trained on drug perturbation data, the model is able to learn distance-dependent and treatment-dependent binding patterns and their resulting reporter gene activation. These results demonstrate that lightweight, easily retrainable models such as ours have utility in probing latent signals present in novel experimental data. Finally, we find only modest differences in performance between different deep-learning architectures when trained on this single data modality, and while somewhat greater predictive accuracy can be achieved with much larger models trained at great expense on many terabytes of data, there is still copious room for improvement even for industrial strength, state-of-the-art models.

Matching journals

The top 6 journals account for 50% of the predicted probability mass.

1
Cell Systems
167 papers in training set
Top 0.2%
22.5%
2
Genome Biology
555 papers in training set
Top 0.3%
10.1%
3
Nature Communications
4913 papers in training set
Top 25%
7.2%
4
Nucleic Acids Research
1128 papers in training set
Top 5%
4.2%
5
Bioinformatics
1061 papers in training set
Top 5%
4.0%
6
The American Journal of Human Genetics
206 papers in training set
Top 1%
4.0%
50% of probability mass above
7
Scientific Reports
3102 papers in training set
Top 37%
3.6%
8
Nature Methods
336 papers in training set
Top 3%
2.9%
9
PLOS Computational Biology
1633 papers in training set
Top 11%
2.9%
10
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 24%
2.9%
11
Genome Medicine
154 papers in training set
Top 3%
2.6%
12
Cell Genomics
162 papers in training set
Top 2%
2.4%
13
NAR Genomics and Bioinformatics
214 papers in training set
Top 1%
2.1%
14
Nature Machine Intelligence
61 papers in training set
Top 1%
2.1%
15
Bioinformatics Advances
184 papers in training set
Top 3%
1.8%
16
Genome Research
409 papers in training set
Top 2%
1.7%
17
Nature
575 papers in training set
Top 11%
1.7%
18
Nature Genetics
240 papers in training set
Top 4%
1.7%
19
PLOS ONE
4510 papers in training set
Top 58%
1.3%
20
BMC Bioinformatics
383 papers in training set
Top 5%
1.3%
21
Nature Biotechnology
147 papers in training set
Top 5%
1.3%
22
Frontiers in Genetics
197 papers in training set
Top 6%
1.3%
23
Cancer Research
116 papers in training set
Top 3%
0.8%
24
Science Advances
1098 papers in training set
Top 28%
0.8%
25
Briefings in Bioinformatics
326 papers in training set
Top 7%
0.7%
26
Computational and Structural Biotechnology Journal
216 papers in training set
Top 10%
0.7%