Back

From Circles to Signals: Representation Learning on Ultra-Long Extrachromosomal Circular DNA

Li, J.; Liu, Z.; Zhang, Z.; Zhang, J.; Singh, R.

2026-03-17 bioinformatics
10.1101/2025.11.22.689941 bioRxiv
Show abstract

Extrachromosomal circular DNA (eccDNA) is a covalently closed circular DNA molecule that plays an important role in cancer biology. Genomic foundation models have recently emerged as a powerful direction for DNA sequence modeling, enabling the direct prediction of biologically relevant properties from DNA sequences. Although recent genomic foundation models have shown strong performance on general DNA sequence modeling, their application to eccDNA remains limited: existing approaches either rely on computationally expensive attention mechanisms or truncate ultra-long sequences into kilobase fragments, thereby disrupting long-range continuity and ignoring the molecules circular topology. To overcome these problems, we introduce eccDNAMamba, a bidirectional state space model (SSM) built upon the Mamba-2 framework, which scales linearly with input sequence length and enables scalable modeling of ultra-long eccDNA sequences. eccDNAMamba further incorporates a circular augmentation strategy to preserve the intrinsic circular topology of eccDNA. Comprehensive evaluations against state-of-the-art genomic foundation models demonstrate that eccDNAMamba achieves superior performance on ultra-long sequences across multiple task settings, such as cancer versus healthy eccDNA discrimination and eccDNA copy-number level prediction. Moreover, the Integrated Gradient (IG) based model explanation indicates that eccDNAMamba focuses on biologically meaningful regulatory elements and can uncover key sequence patterns in cancer-derived eccDNAs. Overall, these results demonstrate that eccDNAMamba effectively models ultra-long eccDNA sequences by leveraging their unique circular topology and regulatory architecture, bridging a critical gap in sequence analysis. Our codes and datasets are available at https://github.com/zzq1zh/eccDNAMamba.

Matching journals

The top 6 journals account for 50% of the predicted probability mass.

1
Bioinformatics
1061 papers in training set
Top 1%
22.1%
2
Briefings in Bioinformatics
326 papers in training set
Top 0.6%
8.3%
3
Nature Communications
4913 papers in training set
Top 27%
6.7%
4
Nucleic Acids Research
1128 papers in training set
Top 3%
6.3%
5
IEEE Transactions on Computational Biology and Bioinformatics
17 papers in training set
Top 0.1%
6.2%
6
Bioinformatics Advances
184 papers in training set
Top 0.8%
4.8%
50% of probability mass above
7
Nature Machine Intelligence
61 papers in training set
Top 0.6%
4.8%
8
Advanced Science
249 papers in training set
Top 4%
4.2%
9
Cell Systems
167 papers in training set
Top 5%
2.6%
10
PLOS Computational Biology
1633 papers in training set
Top 13%
2.3%
11
Frontiers in Genetics
197 papers in training set
Top 4%
2.0%
12
Genome Research
409 papers in training set
Top 2%
1.9%
13
Genome Medicine
154 papers in training set
Top 5%
1.7%
14
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 33%
1.7%
15
NAR Genomics and Bioinformatics
214 papers in training set
Top 2%
1.7%
16
Genome Biology
555 papers in training set
Top 5%
1.5%
17
BMC Bioinformatics
383 papers in training set
Top 5%
1.3%
18
Nature Biotechnology
147 papers in training set
Top 6%
1.2%
19
Scientific Reports
3102 papers in training set
Top 67%
1.2%
20
Nature Methods
336 papers in training set
Top 6%
0.9%
21
Communications Biology
886 papers in training set
Top 19%
0.9%
22
iScience
1063 papers in training set
Top 27%
0.9%
23
Computational and Structural Biotechnology Journal
216 papers in training set
Top 9%
0.8%
24
Journal of Chemical Information and Modeling
207 papers in training set
Top 3%
0.7%
25
Genomics, Proteomics & Bioinformatics
171 papers in training set
Top 6%
0.7%
26
Nature Computational Science
50 papers in training set
Top 2%
0.7%
27
GigaScience
172 papers in training set
Top 3%
0.7%
28
Frontiers in Molecular Biosciences
100 papers in training set
Top 6%
0.6%
29
Patterns
70 papers in training set
Top 3%
0.6%