Sequence-Based Prioritization of Promoter Regulatory Variants in Colorectal Cancer Using a DNA Foundation Model
Shome, S.; Vajinepalli, S.; Saraf, A.
Show abstract
Noncoding regulatory variants contribute to colorectal cancer (CRC) susceptibility, yet their functional interpretation remains difficult.This is mainly attributed to regulatory effects being context-dependent and most noncoding regions lack reliable genomic annotations. We have developed a computational framework that aids in prioritizing promoter-associated variants using Evo2, a large-scale autoregressive DNA foundation model. In the framework, variants were mapped to promoter regions ({+/-}1,024 bp) across [~]1,250 CRC-associated genes and scored using Evo2-derived delta scores, the difference in sequence probability between reference and alternate alleles. Promoter variants showed greater predicted regulatory impact than non-promoter variants (median delta = 0.015 vs. 0.002; overall mean = 0.018, SD = 0.011). Applying a distributional threshold (delta > 0.020; top [~]25%) identified 287 high-impact variants across 198 CRC-associated genes. These genes were enriched in CRC-relevant pathways such as Wnt signaling, p53 signaling, and cell cycle regulation and 36.4% (72/198) overlapped known cancer genes (2.3-fold enrichment, p = 8.7x10-6). Independent validation showed high-impact variants were enriched at CRC GWAS loci and overlapped transcription factor binding sites ([~]32%) and motif-disrupting positions ([~]21%), supporting their functional relevance. Together, these results show that sequence-based foundation models can scalably prioritize noncoding regulatory candidates in CRC without supervised training or predefined annotations.
Matching journals
The top 5 journals account for 50% of the predicted probability mass.