NanoSimFormer: An end-to-end Transformer-based simulator for nanopore sequencing signal data
Xie, S.; Ding, L.; Liu, L.; Zhu, Z.
Show abstract
Nanopore sequencing has achieved a new standard of accuracy with the advent of R10.4.1 flow cell and high-performance Transformer-based basecalling models. However, existing signal simulators often fail to capture the complex, non-linear dynamics of nanopore current signals, relying on static pore models or lacking optimization objectives linked to basecalling, resulting in synthetic signals with substantially lower accuracy and fidelity than experimental data. To address this, we introduce NanoSimFormer, an end-to-end Transformer-based signal simulator that integrates basecaller guidance during training to generate high-fidelity nanopore signals explicitly optimized for accurate calling. Rigorous evaluation across diverse human, bacterial, and fungal R10.4.1 DNA sequencing datasets demonstrates that NanoSimFormer consistently outperformed competing methods (seq2squiggle and Squigulator), achieving median read accuracies exceeding 99% and Q-scores above 22.8, closely matching experimental baselines. NanoSimFormer faithfully recapitulated experimental variant calling performance on the human HG002 sample, achieving F1-scores of 0.9967 for SNPs and 0.8295 for small indels, and notably minimized false-positive errors in homopolymer and short tandem repeat (STR) regions where other simulators struggled. Furthermore, NanoSimFormer-derived reads enabled high-quality de novo bacterial assembly with consensus error rates below one mismatch per 100 kbp, comparable to experimental assemblies, and preserved fungal mock community structures with high correlation to experimental abundance profiles in metagenomic benchmarks. With tunable parameters for amplitude noise and event duration variance, NanoSimFormer enables the simulation of datasets spanning a wide range of data qualities. Together, these results establish NanoSimFormer as a robust tool for benchmarking and algorithm development in the latest nanopore sequencing era.
Matching journals
The top 2 journals account for 50% of the predicted probability mass.