Back

DPGT: A spark based high-performance joint variant calling tool for large cohort sequencing

Gong, C.; Yang, Q.; Wan, R.; Li, S.; Zhang, Y.; Li, Y.

2026-03-05 bioinformatics
10.64898/2026.03.02.709184 bioRxiv
Show abstract

BackgroundJoint variant calling is a crucial step in population-scale sequencing analysis. While population-scale sequencing is a powerful tool for genetic studies, achieving fast and accurate joint variant calling on large cohorts remains computationally challenging. FindingsTo meet this challenge, we developed Distributed Population Genetics Tool (DPGT), an efficient computing framework and a robust tool for joint variant calling on large cohorts based on Apache Spark. DPGT simplifies joint calling tasks for large cohorts with a single command on a local computer or a computing cluster, eliminating the need for users to create complex parallel workflows. We evaluated the performance of DPGT using 2,504 1000 Genomes Project (1KGP), 6 Genome in a Bottle (GIAB) and 9,158 internal whole genome sequencing (WGS) samples together with existing methods. As a result, DPGT produced results comparable in accuracy to existing methods, with less time and better scalability. ConclusionsDPGT is a fast, scalable, and accurate tool for joint variant calling. The source code is available under a GPLv3 license at https://github.com/BGI-flexlab/DPGT, implemented in Java and C++.

Matching journals

The top 3 journals account for 50% of the predicted probability mass.

1
Bioinformatics
1061 papers in training set
Top 0.9%
23.5%
2
BMC Bioinformatics
383 papers in training set
Top 0.1%
23.5%
3
PLOS Computational Biology
1633 papers in training set
Top 7%
4.5%
50% of probability mass above
4
Nucleic Acids Research
1128 papers in training set
Top 4%
4.5%
5
PLOS ONE
4510 papers in training set
Top 34%
4.3%
6
Briefings in Bioinformatics
326 papers in training set
Top 1%
4.1%
7
Bioinformatics Advances
184 papers in training set
Top 1%
3.7%
8
GigaScience
172 papers in training set
Top 0.4%
3.7%
9
Genome Medicine
154 papers in training set
Top 4%
2.0%
10
Computational and Structural Biotechnology Journal
216 papers in training set
Top 4%
1.8%
11
Frontiers in Genetics
197 papers in training set
Top 4%
1.8%
12
BioData Mining
15 papers in training set
Top 0.4%
1.5%
13
Journal of the American Medical Informatics Association
61 papers in training set
Top 1%
1.4%
14
Scientific Reports
3102 papers in training set
Top 63%
1.4%
15
BMC Medical Genomics
36 papers in training set
Top 0.6%
1.4%
16
Nature Communications
4913 papers in training set
Top 57%
1.2%
17
NAR Genomics and Bioinformatics
214 papers in training set
Top 3%
1.0%
18
Journal of Genetics and Genomics
36 papers in training set
Top 2%
1.0%
19
Genomics, Proteomics & Bioinformatics
171 papers in training set
Top 5%
0.9%
20
Database
51 papers in training set
Top 0.9%
0.8%
21
PeerJ
261 papers in training set
Top 14%
0.8%
22
Human Genomics
21 papers in training set
Top 0.4%
0.7%
23
Journal of Translational Medicine
46 papers in training set
Top 3%
0.7%
24
European Journal of Human Genetics
49 papers in training set
Top 2%
0.5%
25
Alzheimer's & Dementia
143 papers in training set
Top 3%
0.5%
26
Genome Biology
555 papers in training set
Top 9%
0.5%
27
F1000Research
79 papers in training set
Top 6%
0.5%
28
BMC Genomics
328 papers in training set
Top 7%
0.5%