Back

Benchmarking within-sample minority variant detection with short-read sequencing in M. tuberculosis

Mulaudzi, S.; Kulkarni, S.; Marin, M. G.; Farhat, M. R.

2026-02-16 bioinformatics
10.64898/2026.02.13.704885 bioRxiv
Show abstract

BackgroundLow-frequency (minority) variants--variants detectable within a sample at low allele frequencies--are relevant in several areas of research and health, ranging from cancer to pathogen heteroresistance. There is uncertainty around the optimal bioinformatic approach to accurately and reproducibly distinguish low-frequency variants from sequencing or mapping error. To address this we benchmarked seven variant callers on precision, recall and false positive characteristics for detecting low-frequency variants using simulated short-read whole genome sequencing data for 700 Mycobacterium tuberculosis strains. We developed a new low-frequency error model for filtering output of the best performing tool using read mapping and quality metrics. ResultsWe simulated 378 unique variants across 5 genomic backgrounds spanning 4 lineages. Variants were simulated to represent 3 genomic region categories, 10 allele frequencies and 5 sequencing depths. FreeBayes, a haplotype-based variant caller, achieved the highest pooled F1 score of the seven tools in drug resistance regions (average F1 = 0.86) and its higher performance held across genomic context and background. Across tools, we identified lower performance in repetitive (low mappability) regions, and strong reference bias in low-frequency variant calling. We validated variant caller performance on a sample of in-vitro strain mixtures substantiating our ranking. When paired with FreeBayes, the error model excludes 49% of false variants and <1% of true variants. ConclusionsOur analysis provides evidence to support best practices for low-frequency variant calling, including tool choice, masking and filtering. We also develop and provide a new error model that excludes false positive low-frequency variant calls from FreeBayes output.

Matching journals

The top 4 journals account for 50% of the predicted probability mass.

1
BMC Bioinformatics
383 papers in training set
Top 0.2%
22.2%
2
Microbial Genomics
204 papers in training set
Top 0.1%
17.3%
3
Genome Medicine
154 papers in training set
Top 0.8%
6.7%
4
Bioinformatics
1061 papers in training set
Top 4%
6.7%
50% of probability mass above
5
Scientific Reports
3102 papers in training set
Top 28%
4.2%
6
Journal of Clinical Microbiology
120 papers in training set
Top 0.6%
3.6%
7
PLOS Computational Biology
1633 papers in training set
Top 10%
3.5%
8
BMC Genomics
328 papers in training set
Top 0.9%
3.5%
9
PLOS ONE
4510 papers in training set
Top 41%
3.5%
10
NAR Genomics and Bioinformatics
214 papers in training set
Top 1%
2.0%
11
PeerJ
261 papers in training set
Top 6%
1.9%
12
GigaScience
172 papers in training set
Top 1%
1.7%
13
Computational and Structural Biotechnology Journal
216 papers in training set
Top 5%
1.5%
14
Nature Communications
4913 papers in training set
Top 54%
1.5%
15
Bioinformatics Advances
184 papers in training set
Top 4%
1.1%
16
F1000Research
79 papers in training set
Top 3%
0.9%
17
The Lancet Microbe
43 papers in training set
Top 1.0%
0.9%
18
Clinical Infectious Diseases
231 papers in training set
Top 4%
0.9%
19
mSystems
361 papers in training set
Top 7%
0.9%
20
Briefings in Bioinformatics
326 papers in training set
Top 7%
0.7%
21
Wellcome Open Research
57 papers in training set
Top 2%
0.7%
22
Malaria Journal
48 papers in training set
Top 2%
0.6%
23
BMC Biology
248 papers in training set
Top 6%
0.6%