Back

Entropy Fusion DNA: Alignment-Free Gene Fusion Detection through Entropy and Mutual Information Descriptors

Benevento, G.; Malandrino, D.; Ture, A.; Zaccagnino, R.

2026-05-30 bioinformatics
10.64898/2026.05.27.728176 bioRxiv
Show abstract

Gene fusions are clinically relevant genomic alterations and key cancer biomarkers. Their computational detection remains dominated by alignment-based pipelines, whose reliance on read mapping, reference annotations, and heuristic filtering makes them sensitive to mapping ambiguities, annotation incompleteness, repetitive regions, and false positives. Recent machine learning (ML) strategies aim to learn fusion-related patterns directly from sequencing data, but their adoption is still limited by dataset-specific biases, synthetic data artifacts, class imbalance, and representations that may overlook the structural organization of biological sequences. Theoretical and statistical sequence descriptors remain underexplored as efficient tools for capturing informative structural signals in biological reads. In this work, we investigate whether fusion-related information can be inferred directly from the statistical organization of DNA sequences. Each sequence is encoded into a compact, interpretable, and alignment-free feature space combining Shannon and Renyi entropy, lagged and base-resolved mutual information, GC content, and rarefied k-mer richness descriptors. Our goal is to assess whether these information-theoretic features encode discriminative sequence signatures associated with fusion events. For discriminating fusion-derived from non-fusion sequences, nested cross-validation selected K-nearest neighbors as the most effective classifier, achieving strong held-out performance on the balanced benchmark (AUROC = 0.892, AUPRC = 0.865). The same representation was then evaluated on fusion-positive samples for fusion partner prediction and breakpoint localization, achieving strong top-k partner identification accuracy and stable breakpoint regression performance. Moreover, a two-stage strategy in which the binary classifier first filters candidate reads further improved partner prediction, suggesting its use as an enrichment step for downstream fusion characterization. Although performance decreased under repeated fusion-pair-disjoint evaluation, it remained clearly above random expectation, supporting the transferability of the proposed descriptors to unseen fusion pairs. Breakpoint-centered validation further revealed increased local sequence complexity, altered short-range dependency structure, and modest but significant microhomology enrichment around fusion regions. Such findings support an interpretable alignment-free framework where information-theoretic features provide predictive and biologically informative signals for gene fusion analysis. The framework is available at: https://github.com/FLaTNNBio/EntropyFusionDNA Graphical Abstract O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=73 SRC="FIGDIR/small/728176v1_ufig1.gif" ALT="Figure 1"> View larger version (23K): org.highwire.dtl.DTLVardef@805fa3org.highwire.dtl.DTLVardef@6f6cdorg.highwire.dtl.DTLVardef@1352c8borg.highwire.dtl.DTLVardef@1ff780b_HPS_FORMAT_FIGEXP M_FIG C_FIG HighlightsO_LIAlignment-free information-theoretic DNA descriptors detect gene fusions. C_LIO_LIResolved mutual-information features provide the strongest predictive signal. C_LIO_LITwo-stage screening enriches partner-gene prediction and breakpoint analysis. C_LI

Matching journals

The top 4 journals account for 50% of the predicted probability mass.

1
Bioinformatics
1061 papers in training set
Top 1%
22.8%
2
Nucleic Acids Research
1128 papers in training set
Top 0.8%
14.9%
3
Nature Communications
4913 papers in training set
Top 16%
10.5%
4
Briefings in Bioinformatics
326 papers in training set
Top 0.5%
8.5%
50% of probability mass above
5
NAR Genomics and Bioinformatics
214 papers in training set
Top 0.2%
6.4%
6
Computational and Structural Biotechnology Journal
216 papers in training set
Top 0.5%
6.4%
7
Advanced Science
249 papers in training set
Top 5%
3.6%
8
PLOS Computational Biology
1633 papers in training set
Top 11%
2.9%
9
Genome Biology
555 papers in training set
Top 3%
2.1%
10
Cell Reports Methods
141 papers in training set
Top 2%
1.9%
11
BMC Bioinformatics
383 papers in training set
Top 4%
1.7%
12
Nature Biotechnology
147 papers in training set
Top 5%
1.7%
13
Bioinformatics Advances
184 papers in training set
Top 3%
1.5%
14
Cell Systems
167 papers in training set
Top 9%
1.3%
15
GigaScience
172 papers in training set
Top 2%
1.2%
16
Patterns
70 papers in training set
Top 2%
0.8%
17
Nature Machine Intelligence
61 papers in training set
Top 3%
0.8%
18
Frontiers in Genetics
197 papers in training set
Top 9%
0.8%
19
Communications Biology
886 papers in training set
Top 23%
0.8%
20
iScience
1063 papers in training set
Top 34%
0.7%
21
Frontiers in Bioinformatics
45 papers in training set
Top 1%
0.7%
22
Scientific Reports
3102 papers in training set
Top 77%
0.7%
23
International Journal of Molecular Sciences
453 papers in training set
Top 19%
0.5%