Back

Sequence-Based Prioritization of Promoter Regulatory Variants in Colorectal Cancer Using a DNA Foundation Model

Shome, S.; Vajinepalli, S.; Saraf, A.

2026-05-28 bioinformatics
10.64898/2026.05.25.727528 bioRxiv
Show abstract

Noncoding regulatory variants contribute to colorectal cancer (CRC) susceptibility, yet their functional interpretation remains difficult.This is mainly attributed to regulatory effects being context-dependent and most noncoding regions lack reliable genomic annotations. We have developed a computational framework that aids in prioritizing promoter-associated variants using Evo2, a large-scale autoregressive DNA foundation model. In the framework, variants were mapped to promoter regions ({+/-}1,024 bp) across [~]1,250 CRC-associated genes and scored using Evo2-derived delta scores, the difference in sequence probability between reference and alternate alleles. Promoter variants showed greater predicted regulatory impact than non-promoter variants (median delta = 0.015 vs. 0.002; overall mean = 0.018, SD = 0.011). Applying a distributional threshold (delta > 0.020; top [~]25%) identified 287 high-impact variants across 198 CRC-associated genes. These genes were enriched in CRC-relevant pathways such as Wnt signaling, p53 signaling, and cell cycle regulation and 36.4% (72/198) overlapped known cancer genes (2.3-fold enrichment, p = 8.7x10-6). Independent validation showed high-impact variants were enriched at CRC GWAS loci and overlapped transcription factor binding sites ([~]32%) and motif-disrupting positions ([~]21%), supporting their functional relevance. Together, these results show that sequence-based foundation models can scalably prioritize noncoding regulatory candidates in CRC without supervised training or predefined annotations.

Matching journals

The top 5 journals account for 50% of the predicted probability mass.

1
Bioinformatics
1061 papers in training set
Top 1%
22.1%
2
Genome Medicine
154 papers in training set
Top 0.4%
10.2%
3
BMC Bioinformatics
383 papers in training set
Top 1%
8.2%
4
PLOS Computational Biology
1633 papers in training set
Top 5%
6.7%
5
Briefings in Bioinformatics
326 papers in training set
Top 1.0%
6.2%
50% of probability mass above
6
Bioinformatics Advances
184 papers in training set
Top 1%
3.5%
7
Nature Communications
4913 papers in training set
Top 41%
3.5%
8
NAR Genomics and Bioinformatics
214 papers in training set
Top 0.9%
3.0%
9
BioData Mining
15 papers in training set
Top 0.1%
3.0%
10
Frontiers in Genetics
197 papers in training set
Top 3%
2.7%
11
Nucleic Acids Research
1128 papers in training set
Top 8%
2.4%
12
Scientific Reports
3102 papers in training set
Top 47%
2.3%
13
Computational and Structural Biotechnology Journal
216 papers in training set
Top 3%
2.0%
14
Cell Systems
167 papers in training set
Top 7%
1.7%
15
European Journal of Human Genetics
49 papers in training set
Top 0.7%
1.6%
16
Genome Biology
555 papers in training set
Top 5%
1.5%
17
PLOS ONE
4510 papers in training set
Top 58%
1.3%
18
Cell Genomics
162 papers in training set
Top 4%
1.3%
19
The American Journal of Human Genetics
206 papers in training set
Top 3%
1.2%
20
GigaScience
172 papers in training set
Top 3%
0.8%
21
Communications Biology
886 papers in training set
Top 25%
0.7%
22
BMC Genomics
328 papers in training set
Top 7%
0.6%
23
BMC Medical Genomics
36 papers in training set
Top 2%
0.6%