Back

Workstation benchmark of Spark Capable Genome Analysis ToolKit 4 Variant Calling

Hansen, M. H.; Simonsen, A. T.; Ommen, H. B.; Nyvold, C. G.

2020-05-19 bioinformatics
10.1101/2020.05.17.101105 bioRxiv
Show abstract

BackgroundRapid and practical DNA-sequencing processing has become essential for modern biomedical laboratories, especially in the field of cancer, pathology and genetics. While sequencing turn-over time has been, and still is, a bottleneck in research and diagnostics, the field of bioinformatics is moving at a rapid pace - both in terms of hardware and software development. Here, we benchmarked the local performance of three of the most important Spark-enabled Genome analysis toolkit 4 (GATK4) tools in a targeted sequencing workflow: Duplicate marking, base quality score recalibration (BQSR) and variant calling on targeted DNA sequencing using a modest hyperthreading 12-core single CPU and a high-speed PCI express solid-state drive. ResultsCompared to the previous GATK version the performance of Spark-enabled BQSR and HaplotypeCaller is shifted towards a more efficient usage of the available cores on CPU and outperforms the earlier GATK3.8 version with an order of magnitude reduction in processing time to analysis ready variants, whereas MarkDuplicateSpark was found to be thrice as fast. Furthermore, HaploTypeCallerSpark and BQSRPipelineSpark were significantly faster than the equivalent GATK4 standard tools with a combined [~]86% reduction in execution time, reaching a median rate of ten million processed bases per second, and duplicate marking was reduced [~]42%. The called variants were found to be in close agreement between the Spark and non-Spark versions, with an overall concordance of 98%. In this setup, the tools were also highly efficient when compared execution on a small 72 virtual CPU/18-node Google Cloud cluster. ConclusionIn conclusion, GATK4 offers practical parallelization possibilities for DNA sequence processing, and the Spark-enabled tools optimize performance and utilization of local CPUs. Spark utilizing GATK variant calling is several times faster than previous GATK3.8 multithreading with the same multi-core, single CPU, configuration. The improved opportunities for parallel computations not only hold implications for high-performance cluster, but also for modest laboratory or research workstations for targeted sequencing analysis, such as exome, panel or amplicon sequencing.

Matching journals

The top 2 journals account for 50% of the predicted probability mass.

1
BMC Bioinformatics
383 papers in training set
Top 0.1%
44.5%
2
GigaScience
172 papers in training set
Top 0.2%
6.8%
50% of probability mass above
3
PLOS ONE
4510 papers in training set
Top 33%
4.5%
4
Bioinformatics
1061 papers in training set
Top 5%
4.2%
5
Computational and Structural Biotechnology Journal
216 papers in training set
Top 1%
3.9%
6
Gigabyte
60 papers in training set
Top 0.3%
3.8%
7
PeerJ
261 papers in training set
Top 2%
3.8%
8
Scientific Reports
3102 papers in training set
Top 43%
2.8%
9
PLOS Computational Biology
1633 papers in training set
Top 14%
2.0%
10
F1000Research
79 papers in training set
Top 1%
1.8%
11
NAR Genomics and Bioinformatics
214 papers in training set
Top 2%
1.3%
12
Briefings in Bioinformatics
326 papers in training set
Top 5%
1.3%
13
BMC Genomics
328 papers in training set
Top 4%
1.0%
14
Methods
29 papers in training set
Top 0.4%
1.0%
15
Journal of Proteome Research
215 papers in training set
Top 2%
0.8%
16
BMC Medical Genomics
36 papers in training set
Top 1%
0.8%
17
SoftwareX
15 papers in training set
Top 0.3%
0.8%
18
Genes
126 papers in training set
Top 3%
0.8%
19
Biology
43 papers in training set
Top 2%
0.8%
20
Access Microbiology
22 papers in training set
Top 0.8%
0.7%
21
Genome Medicine
154 papers in training set
Top 9%
0.7%
22
Clinical and Translational Science
21 papers in training set
Top 1%
0.7%
23
G3 Genes|Genomes|Genetics
351 papers in training set
Top 3%
0.7%
24
Informatics in Medicine Unlocked
21 papers in training set
Top 2%
0.5%
25
BioData Mining
15 papers in training set
Top 1%
0.5%