Back

A recurrent sequencing artifact on Illumina sequencers with two-color fluorescent dye chemistry and its impact on somatic variant detection

Fu, B. J.; Viswanadham, V. V.; Maziec, D.; Jin, H.; Park, P. J.

2025-09-29 genomics
10.1101/2025.09.27.678978 bioRxiv
Show abstract

BackgroundThe sequencing-by-synthesis technology by Illumina, Inc. enables efficient and scalable readouts of mutations from genomic data. To enhance sequencing speed and efficiency, Illumina has shifted from the four-color base calling chemistry of the HiSeq series to a two-color fluorescent dye chemistry in the NovaSeq series. Benchmarking sequencing artifacts due to biases in the newer chemistry is important to evaluate the quality of identified mutations. ResultsWe re-analyzed a series of whole-genome sequencing experiments in which the same samples were sequenced on the NovaSeq 6000 (two-color) and HiSeq X10 (four-color) platforms by independent groups. In several samples, we observed a higher frequency of T-to-G and A-to-C substitutions ("T>G") at the read level for NovaSeq 6000 versus HiSeq X10. As the per-base error rate is still low, the artifactual substitutions have a negligible effect in identifying germline or high variant allele frequency (VAF) somatic mutations. However, such errors can confound the detection of low-VAF somatic variants in high-depth sequencing samples, particularly in studies of mosaic mutations in normal tissues, where variants have low read support and are called without a matched normal. The artifactual T>G variant calls disproportionately occur at NT[TG] trinucleotides, and we leveraged this observation to bioinformatically reduce the T>G excess in somatic mutation callsets. ConclusionsWe identified a recurrent artifact specific to the Illumina two-color chemistry platform on the NovaSeq 6000 with the potential to contaminate low-VAF somatic mutation calls. Thus, an unexpected enrichment of T>G mutations in mosaicism studies warrants caution.

Matching journals

The top 6 journals account for 50% of the predicted probability mass.

1
The Journal of Molecular Diagnostics
36 papers in training set
Top 0.1%
18.5%
2
Scientific Reports
3102 papers in training set
Top 7%
10.0%
3
Clinical Chemistry
22 papers in training set
Top 0.1%
6.3%
4
BMC Bioinformatics
383 papers in training set
Top 2%
6.3%
5
BMC Genomics
328 papers in training set
Top 0.4%
4.8%
6
PLOS ONE
4510 papers in training set
Top 32%
4.8%
50% of probability mass above
7
Human Mutation
29 papers in training set
Top 0.2%
3.6%
8
Bioinformatics
1061 papers in training set
Top 5%
3.6%
9
Genome Medicine
154 papers in training set
Top 2%
3.6%
10
Computational and Structural Biotechnology Journal
216 papers in training set
Top 2%
3.6%
11
Journal of Clinical Virology
62 papers in training set
Top 0.2%
2.7%
12
BMC Medical Genomics
36 papers in training set
Top 0.4%
1.8%
13
NAR Genomics and Bioinformatics
214 papers in training set
Top 2%
1.7%
14
PeerJ
261 papers in training set
Top 8%
1.5%
15
PLOS Computational Biology
1633 papers in training set
Top 19%
1.3%
16
GigaScience
172 papers in training set
Top 2%
0.9%
17
Genome Biology
555 papers in training set
Top 6%
0.9%
18
Communications Biology
886 papers in training set
Top 21%
0.8%
19
Nature Communications
4913 papers in training set
Top 61%
0.8%
20
Genetics in Medicine
69 papers in training set
Top 1%
0.7%
21
Clinical Infectious Diseases
231 papers in training set
Top 5%
0.7%
22
Diagnostic Microbiology and Infectious Disease
21 papers in training set
Top 0.3%
0.7%
23
Briefings in Bioinformatics
326 papers in training set
Top 7%
0.6%
24
Journal of Genetics and Genomics
36 papers in training set
Top 3%
0.6%
25
Journal of Clinical Microbiology
120 papers in training set
Top 2%
0.6%
26
npj Genomic Medicine
33 papers in training set
Top 1%
0.6%