Back

Reference-free clustering as an epidemiological tool for Mycobacterium tuberculosis lineage typing

Chilengue, A. F.; Whiley, D. J.; Cox, K.; Domingo-Sananes, M. R.; Meehan, C. J.

2026-02-06 bioinformatics
10.64898/2026.02.05.703994 bioRxiv
Show abstract

Whole-genome sequencing (WGS) of Mycobacterium tuberculosis (Mtb) is widely used in the epidemiological investigation of recent transmission events, resulting in high-resolution strain typing. Accurate and rapid strain typing is essential for informing outbreak investigations and guiding tuberculosis control strategies. However, the gold-standard reference-guided SNP-calling pipeline currently used for strain typing relies on computationally intensive reference-mapping approaches, making it challenging to perform in many high-burden, resource-limited settings, where simplified and scalable genomic tools are urgently needed. To address these limitations, we explored reference-free methods for medium resolution epidemiology, namely Mtb strain (lineage) typing, using a dataset of 535 complete genomes spanning the human- and animal-adapted lineages. Illumina paired-end reads were simulated from each complete genome, assembled, and analysed using three reference-free, k-mer-based tools: MASH, PopPUNK, and SKA2 (Split K-mer Analysis). Genetic distances were generated for each method and compared with a ground truth lineage assignment from with TB Profiler. Our results demonstrated that reference-free methods can effectively distinguish Mtb lineages, with SKA2 showing the most promising performance across all datasets. SKA2 consistently recovered lineage and sub-lineage structure with high accuracy, demonstrating strong potential as an alternative to traditional WGS workflows. These findings highlight the utility of reference-free methods, particularly SKA2, for enabling accessible, scalable, and rapid Mtb strain typing, while supporting genomic epidemiology with low computational resources.

Matching journals

The top 4 journals account for 50% of the predicted probability mass.

1
Microbial Genomics
204 papers in training set
Top 0.1%
37.5%
2
Scientific Reports
3102 papers in training set
Top 11%
8.2%
3
Bioinformatics
1061 papers in training set
Top 5%
3.6%
4
Genome Medicine
154 papers in training set
Top 3%
3.1%
50% of probability mass above
5
Nature Communications
4913 papers in training set
Top 42%
3.1%
6
PLOS Computational Biology
1633 papers in training set
Top 12%
2.7%
7
BMC Bioinformatics
383 papers in training set
Top 3%
2.7%
8
Briefings in Bioinformatics
326 papers in training set
Top 3%
2.6%
9
PLOS ONE
4510 papers in training set
Top 45%
2.6%
10
Frontiers in Microbiology
375 papers in training set
Top 4%
2.1%
11
mSystems
361 papers in training set
Top 4%
1.9%
12
Communications Biology
886 papers in training set
Top 7%
1.8%
13
Tuberculosis
11 papers in training set
Top 0.1%
1.7%
14
Microbiology Spectrum
435 papers in training set
Top 3%
1.5%
15
Advanced Science
249 papers in training set
Top 13%
1.3%
16
Computational and Structural Biotechnology Journal
216 papers in training set
Top 6%
1.2%
17
Genomics, Proteomics & Bioinformatics
171 papers in training set
Top 4%
1.2%
18
NAR Genomics and Bioinformatics
214 papers in training set
Top 3%
1.2%
19
Frontiers in Cellular and Infection Microbiology
98 papers in training set
Top 4%
1.2%
20
Nucleic Acids Research
1128 papers in training set
Top 14%
1.2%
21
eLife
5422 papers in training set
Top 52%
0.9%
22
GigaScience
172 papers in training set
Top 3%
0.8%
23
mSphere
281 papers in training set
Top 6%
0.7%
24
Journal of Clinical Microbiology
120 papers in training set
Top 2%
0.7%
25
PeerJ
261 papers in training set
Top 16%
0.7%
26
BMC Genomics
328 papers in training set
Top 7%
0.6%