Back

DamageFormer: a damage-aware multimodal deep learning framework for DNA lesion identification from nanopore sequencing

Yang, Q.; Li, L.; Ma, Q.; Yin, R.

2026-05-18 genomics
10.64898/2026.05.14.725245 bioRxiv
Show abstract

BackgroundDNA lesions arise from endogenous metabolism and environmental exposure and are the major drivers of mutagenesis, aging, and cancer development. However, mapping DNA damage at nucleotide resolution remains a technically challenging task. Nanopore sequencing enables direct detection of chemical perturbations through alterations in ionic current signals. Despite this potential, existing computational approaches remain limited in their capacity to generalize across diverse lesion types and to effectively integrate nucleotide sequence context with raw signal information for accurate detection and localization. ResultsWe presented DamageFormer, a multimodal deep learning framework for detection and localization of DNA lesions using native nanopore sequencing data. Central to this framework is LesionBERT, a damage-aware genomic foundation model built upon DNABERT-2 and enhanced with lesion-focused reconstruction objectives to improve representation of chemically modified bases. DamageFormer integrated LesionBERT with a neural signal model through an adaptive gating mechanism, enabling dynamic weighting of sequence context and nanopore signal evidence. The model was trained using a joint objective that combines prediction, localization, and contrastive alignment losses to promote cross-modal coherence and spatial precision. On an oxidative DNA damage benchmark comprising paired sequence and signal data, DamageFormer achieved an AUROC of 0.99997 for lesion detection and a mean absolute localization error of 0.00439, consistently outperforming state-of-the-art baselines. Model interpretation analyses revealed context-dependent modality weighting that adapts to variation in signal quality and sequence ambiguity. The proposed framework further generalizes to chemically distinct guanine lesions not observed during the training process, demonstrating its robustness and transferability to unseen damage types. ConclusionsDamage-aware biological language modeling combined with adaptive multimodal fusion enables accurate and interpretable identification of DNA lesions from nanopore sequencing data. This framework provides a scalable approach for characterizing genome-wide damage landscapes and illustrates how chemical DNA information can be systematically incorporated into genomic language models. The source code and pretrained models of this work are available at: https://github.com/UF-HOBIYin-Lab/DamageFormer.

Matching journals

The top 6 journals account for 50% of the predicted probability mass.

1
Bioinformatics
1061 papers in training set
Top 1%
18.7%
2
Computational and Structural Biotechnology Journal
216 papers in training set
Top 0.2%
10.1%
3
Nature Communications
4913 papers in training set
Top 26%
6.8%
4
Briefings in Bioinformatics
326 papers in training set
Top 0.7%
6.8%
5
Genome Medicine
154 papers in training set
Top 1.0%
6.3%
6
GigaScience
172 papers in training set
Top 0.3%
4.3%
50% of probability mass above
7
NAR Genomics and Bioinformatics
214 papers in training set
Top 0.4%
4.3%
8
Genome Biology
555 papers in training set
Top 2%
4.2%
9
Scientific Reports
3102 papers in training set
Top 41%
3.1%
10
PLOS Computational Biology
1633 papers in training set
Top 11%
3.1%
11
Nucleic Acids Research
1128 papers in training set
Top 7%
2.7%
12
Bioinformatics Advances
184 papers in training set
Top 2%
2.7%
13
BMC Bioinformatics
383 papers in training set
Top 3%
2.6%
14
Nature Machine Intelligence
61 papers in training set
Top 2%
1.9%
15
BMC Genomics
328 papers in training set
Top 2%
1.7%
16
Cell Reports Methods
141 papers in training set
Top 3%
1.5%
17
Nature Biotechnology
147 papers in training set
Top 5%
1.3%
18
Genome Research
409 papers in training set
Top 3%
1.3%
19
PLOS ONE
4510 papers in training set
Top 61%
1.1%
20
Frontiers in Genetics
197 papers in training set
Top 7%
1.0%
21
Advanced Science
249 papers in training set
Top 18%
0.8%
22
Patterns
70 papers in training set
Top 2%
0.7%
23
iScience
1063 papers in training set
Top 34%
0.7%
24
Communications Biology
886 papers in training set
Top 29%
0.6%