Reference-free clustering as an epidemiological tool for Mycobacterium tuberculosis lineage typing
Chilengue, A. F.; Whiley, D. J.; Cox, K.; Domingo-Sananes, M. R.; Meehan, C. J.
Show abstract
Whole-genome sequencing (WGS) of Mycobacterium tuberculosis (Mtb) is widely used in the epidemiological investigation of recent transmission events, resulting in high-resolution strain typing. Accurate and rapid strain typing is essential for informing outbreak investigations and guiding tuberculosis control strategies. However, the gold-standard reference-guided SNP-calling pipeline currently used for strain typing relies on computationally intensive reference-mapping approaches, making it challenging to perform in many high-burden, resource-limited settings, where simplified and scalable genomic tools are urgently needed. To address these limitations, we explored reference-free methods for medium resolution epidemiology, namely Mtb strain (lineage) typing, using a dataset of 535 complete genomes spanning the human- and animal-adapted lineages. Illumina paired-end reads were simulated from each complete genome, assembled, and analysed using three reference-free, k-mer-based tools: MASH, PopPUNK, and SKA2 (Split K-mer Analysis). Genetic distances were generated for each method and compared with a ground truth lineage assignment from with TB Profiler. Our results demonstrated that reference-free methods can effectively distinguish Mtb lineages, with SKA2 showing the most promising performance across all datasets. SKA2 consistently recovered lineage and sub-lineage structure with high accuracy, demonstrating strong potential as an alternative to traditional WGS workflows. These findings highlight the utility of reference-free methods, particularly SKA2, for enabling accessible, scalable, and rapid Mtb strain typing, while supporting genomic epidemiology with low computational resources.
Matching journals
The top 4 journals account for 50% of the predicted probability mass.