Back

Assessing and Optimizing Low-Frequency Somatic Mutation Detection: A Multi-Platform High-Throughput Sequencing Perspective

Feng, B. N.; Lin, Y.; Liu, L.; Lin, Q.; Lin, Y.; Liu, Y.; Li, J.; Lei, C.; Chen, C.; Yang, M.; Peng, X.; Zhou, Z.; Yan, Q.; Sun, L.; Li, Q.

2026-06-01 bioinformatics
10.64898/2026.05.28.728367 bioRxiv
Show abstract

The availability of multiple commercial short-read sequencing platforms necessitates systematic cross-platform performance comparisons, particularly for challenging applications such as low-frequency somatic mutation detection. Here, a large-scale targeted sequencing dataset from five Genome in a Bottle (GIAB) human genomic DNA reference standards, HG001 to HG005, alongside Twist Biosciences cfDNA reference standards featuring 1% variant allele frequency (VAF), was generated by six platforms (NovaSeq 6000, NovaSeq X, FASTASeq 300, GenoLab M, SURFSeq 5000, and MGISEQ-T7). To build a realistic benchmark while keeping authentic sequencing backgrounds, we developed PosMix, a simulating tool that generates position-specific VAFs. To overcome the limitations of conventional variant callers (high recall with poor precision for VarScan2, higher precision with lower recall for Strelka2/Mutect2), we developed SomaticXGB, a machine learning-based caller. In this study, SURFSeq 5000 consistently exhibited the lowest error rates and achieved superior accuracy for VAFs as low as 0.5%, outperforming all other sequencing platforms. On the other hand, SomaticXGB attained F1 scores of approximately 0.92 on simulated datasets with VAFs ranging from 0.5% to 1.5% and 0.89 on Twist 1% standards, substantially outperforming conventional methods. This work delivers a valuable rich multi-platform data resource, offering a standardized pipeline for performance benchmarking and a machine learning-based strategy for optimized somatic mutation detection.

Matching journals

The top 9 journals account for 50% of the predicted probability mass.

1
Briefings in Bioinformatics
326 papers in training set
Top 0.5%
9.0%
2
Genome Medicine
154 papers in training set
Top 0.6%
8.3%
3
Computational and Structural Biotechnology Journal
216 papers in training set
Top 0.4%
6.7%
4
Scientific Reports
3102 papers in training set
Top 19%
6.3%
5
Nature Communications
4913 papers in training set
Top 30%
6.3%
6
BMC Bioinformatics
383 papers in training set
Top 2%
4.3%
7
PLOS ONE
4510 papers in training set
Top 34%
4.1%
8
NAR Genomics and Bioinformatics
214 papers in training set
Top 0.6%
3.9%
9
Bioinformatics
1061 papers in training set
Top 5%
3.6%
50% of probability mass above
10
GigaScience
172 papers in training set
Top 0.5%
3.5%
11
Genomics, Proteomics & Bioinformatics
171 papers in training set
Top 2%
3.5%
12
Nucleic Acids Research
1128 papers in training set
Top 6%
3.5%
13
Genome Biology
555 papers in training set
Top 2%
3.5%
14
Communications Biology
886 papers in training set
Top 3%
3.0%
15
Advanced Science
249 papers in training set
Top 8%
2.3%
16
Nature Biotechnology
147 papers in training set
Top 4%
2.3%
17
Bioinformatics Advances
184 papers in training set
Top 3%
1.7%
18
International Journal of Molecular Sciences
453 papers in training set
Top 8%
1.7%
19
BMC Genomics
328 papers in training set
Top 5%
0.9%
20
Frontiers in Bioinformatics
45 papers in training set
Top 0.7%
0.9%
21
PLOS Computational Biology
1633 papers in training set
Top 22%
0.9%
22
Cell Genomics
162 papers in training set
Top 7%
0.7%
23
Nature Machine Intelligence
61 papers in training set
Top 4%
0.7%
24
Cell Reports Methods
141 papers in training set
Top 5%
0.7%
25
Nature Methods
336 papers in training set
Top 6%
0.7%
26
Analytical Chemistry
205 papers in training set
Top 3%
0.6%