Back

Hierarchical refinements of cis-regulatory inputs improve scalable gene expression prediction

Zhang, Q.; Xing, M.; Liao, Q.; Li, Z.; Huang, D.-S.

2026-06-02 bioinformatics
10.64898/2026.05.31.729151 bioRxiv
Show abstract

Deciphering the relationships between cis-regulatory elements (CREs) and target gene expression has long been a challenging problem in molecular biology. However, predicting gene expression from hundreds of candidate cis-regulatory elements (cCREs) requires models that scale to long, noisy inputs while retaining interpretable regulatory structure. Existing Transformer-based approaches typically attend over all nucleotides and all surrounding cCREs, diluting causal signals when hundreds of elements compete for limited model capacity. Here we introduce a two-stage selective framework (TSSF) that performs hierarchical refinements: nucleotide-level masking within each cCRE, followed by cCRE-level selection around each gene, implemented with information-bottleneck priors and a fully Transformer-based architecture. Across 70 human cell types and tissues, TSSF and lightweight variants improve expression prediction and enhancer-gene prioritization relative to strong baselines, including on cross-cell-line and cell-type-specific benchmarks. Prediction-stratified analysis motivates a distance-decay prior that aligns attention with long-range regulatory geometry, and chromatin-contact augmentation improves recovery of distal links. Motif analyses of high-confidence predictions recover proximal and distal regulatory programs, supporting mechanistic interpretability. TSSF offers a general strategy for scalable, interpretable modeling of high-dimensional regulatory inputs in genomics.

Matching journals

The top 6 journals account for 50% of the predicted probability mass.

1
Cell Systems
167 papers in training set
Top 0.6%
14.3%
2
Genome Biology
555 papers in training set
Top 0.3%
12.0%
3
Nature Methods
336 papers in training set
Top 1%
8.9%
4
Nature Communications
4913 papers in training set
Top 28%
6.6%
5
Nature Biotechnology
147 papers in training set
Top 1%
6.6%
6
Bioinformatics
1061 papers in training set
Top 4%
6.2%
50% of probability mass above
7
Nature Genetics
240 papers in training set
Top 2%
4.2%
8
Nucleic Acids Research
1128 papers in training set
Top 5%
4.2%
9
Genome Medicine
154 papers in training set
Top 2%
3.5%
10
Nature Machine Intelligence
61 papers in training set
Top 1%
3.0%
11
Science
429 papers in training set
Top 11%
2.7%
12
PLOS Computational Biology
1633 papers in training set
Top 12%
2.5%
13
Genome Research
409 papers in training set
Top 2%
2.3%
14
NAR Genomics and Bioinformatics
214 papers in training set
Top 2%
1.7%
15
Nature
575 papers in training set
Top 11%
1.7%
16
The American Journal of Human Genetics
206 papers in training set
Top 2%
1.7%
17
Cell Genomics
162 papers in training set
Top 4%
1.6%
18
Briefings in Bioinformatics
326 papers in training set
Top 4%
1.4%
19
Molecular Cell
308 papers in training set
Top 8%
1.2%
20
BMC Bioinformatics
383 papers in training set
Top 6%
0.9%
21
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 42%
0.9%
22
Advanced Science
249 papers in training set
Top 20%
0.7%
23
Nature Plants
84 papers in training set
Top 2%
0.7%