Entropy Fusion DNA: Alignment-Free Gene Fusion Detection through Entropy and Mutual Information Descriptors

Benevento, G.; Malandrino, D.; Ture, A.; Zaccagnino, R.

2026-05-30 bioinformatics

10.64898/2026.05.27.728176 bioRxiv

Show abstract

Gene fusions are clinically relevant genomic alterations and key cancer biomarkers. Their computational detection remains dominated by alignment-based pipelines, whose reliance on read mapping, reference annotations, and heuristic filtering makes them sensitive to mapping ambiguities, annotation incompleteness, repetitive regions, and false positives. Recent machine learning (ML) strategies aim to learn fusion-related patterns directly from sequencing data, but their adoption is still limited by dataset-specific biases, synthetic data artifacts, class imbalance, and representations that may overlook the structural organization of biological sequences. Theoretical and statistical sequence descriptors remain underexplored as efficient tools for capturing informative structural signals in biological reads. In this work, we investigate whether fusion-related information can be inferred directly from the statistical organization of DNA sequences. Each sequence is encoded into a compact, interpretable, and alignment-free feature space combining Shannon and Renyi entropy, lagged and base-resolved mutual information, GC content, and rarefied k-mer richness descriptors. Our goal is to assess whether these information-theoretic features encode discriminative sequence signatures associated with fusion events. For discriminating fusion-derived from non-fusion sequences, nested cross-validation selected K-nearest neighbors as the most effective classifier, achieving strong held-out performance on the balanced benchmark (AUROC = 0.892, AUPRC = 0.865). The same representation was then evaluated on fusion-positive samples for fusion partner prediction and breakpoint localization, achieving strong top-k partner identification accuracy and stable breakpoint regression performance. Moreover, a two-stage strategy in which the binary classifier first filters candidate reads further improved partner prediction, suggesting its use as an enrichment step for downstream fusion characterization. Although performance decreased under repeated fusion-pair-disjoint evaluation, it remained clearly above random expectation, supporting the transferability of the proposed descriptors to unseen fusion pairs. Breakpoint-centered validation further revealed increased local sequence complexity, altered short-range dependency structure, and modest but significant microhomology enrichment around fusion regions. Such findings support an interpretable alignment-free framework where information-theoretic features provide predictive and biologically informative signals for gene fusion analysis. The framework is available at: https://github.com/FLaTNNBio/EntropyFusionDNA Graphical Abstract O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=73 SRC="FIGDIR/small/728176v1_ufig1.gif" ALT="Figure 1"> View larger version (23K): org.highwire.dtl.DTLVardef@805fa3org.highwire.dtl.DTLVardef@6f6cdorg.highwire.dtl.DTLVardef@1352c8borg.highwire.dtl.DTLVardef@1ff780b_HPS_FORMAT_FIGEXP M_FIG C_FIG HighlightsO_LIAlignment-free information-theoretic DNA descriptors detect gene fusions. C_LIO_LIResolved mutual-information features provide the strongest predictive signal. C_LIO_LITwo-stage screening enriches partner-gene prediction and breakpoint analysis. C_LI

Entropy Fusion DNA: Alignment-Free Gene Fusion Detection through Entropy and Mutual Information Descriptors

Matching journals