Back

vcfilt: A Zero-Allocation Streaming Filter for High-Throughput VCF Processing

KP, M. M.

2026-04-16 bioinformatics
10.64898/2026.04.14.718370 bioRxiv
Show abstract

Variant Call Format (VCF) files are the dominant interchange format for genomic variant data, but their size - routinely exceeding tens of gigabytes for population-scale studies - creates a significant computational bottleneck at the quality-filtering stage. Existing tools such as bcftools and vcftools provide broad functionality through general-purpose expression engines, but incur substantial per-record overhead from dynamic field lookup, type resolution, and heap allocation. We present vcfilt, a streaming, batch-parallel VCF filter implemented in Go that restricts its scope to three high-frequency filter criteria (INFO/DP, INFO/AF, and QUAL) and applies them via a zero-allocation byte-scan parser. Benchmarked on real 1000 Genomes Project data (chromosome 20, 1,811,146 variants), vcfilt achieves 147,000 variants/second on an 18 GB plain-text VCF file using a single thread - a 12.2x speedup over bcftools 1.18 under identical conditions. On gzip-compressed input, the speedup is 7.9x. Output is byte-for-byte identical to bcftools across all tested filter combinations. vcfilt is distributed as a self-contained static binary, a Docker image, and a Singularity-compatible container. The source code and all benchmark scripts are openly available under the MIT licence.

Matching journals

The top 4 journals account for 50% of the predicted probability mass.

1
Bioinformatics
1061 papers in training set
Top 1%
18.8%
2
Genome Biology
555 papers in training set
Top 0.1%
14.5%
3
Nature Methods
336 papers in training set
Top 1%
8.5%
4
Nature Communications
4913 papers in training set
Top 22%
8.5%
50% of probability mass above
5
Nucleic Acids Research
1128 papers in training set
Top 2%
7.2%
6
Genome Medicine
154 papers in training set
Top 1%
4.9%
7
Nature Biotechnology
147 papers in training set
Top 2%
4.3%
8
BMC Bioinformatics
383 papers in training set
Top 2%
4.3%
9
Genome Research
409 papers in training set
Top 0.9%
3.6%
10
Bioinformatics Advances
184 papers in training set
Top 2%
2.1%
11
Briefings in Bioinformatics
326 papers in training set
Top 3%
1.9%
12
The American Journal of Human Genetics
206 papers in training set
Top 2%
1.7%
13
PLOS ONE
4510 papers in training set
Top 53%
1.7%
14
Nature Genetics
240 papers in training set
Top 5%
1.5%
15
Cell Systems
167 papers in training set
Top 8%
1.3%
16
NAR Genomics and Bioinformatics
214 papers in training set
Top 3%
1.2%
17
PLOS Computational Biology
1633 papers in training set
Top 21%
1.0%
18
Nature
575 papers in training set
Top 13%
1.0%
19
Nature Computational Science
50 papers in training set
Top 1%
1.0%
20
GigaScience
172 papers in training set
Top 3%
0.8%
21
Cell Genomics
162 papers in training set
Top 6%
0.8%
22
Genomics, Proteomics & Bioinformatics
171 papers in training set
Top 6%
0.7%
23
Scientific Reports
3102 papers in training set
Top 78%
0.6%
24
Nature Machine Intelligence
61 papers in training set
Top 4%
0.5%
25
Science
429 papers in training set
Top 22%
0.5%