Back

A comparison of scalable approaches for the pairwise analysis of large pathogen genomic and spatial datasets: an application to studying Mycobacterium tuberculosis transmission

Lan, Y.; Wu, C.-Y.; Lin, H.-H.; Cohen, T.; Warren, J. L.

2026-05-21 microbiology
10.64898/2026.05.21.726848 bioRxiv
Show abstract

Pairwise analysis of genomic and spatial data offers opportunities to identify and estimate the associations between covariates and the transmission of pathogens between individuals. However, such pairwise analyses are computationally intensive, and may not be feasible to conduct given the high dyad count in even moderately sized datasets. Here we compare two approaches to increase the efficiency of pairwise analysis for large datasets. We quantify and compare the performance of divide-and-conquer Bayesian model fitting and pairwise case-control approaches for estimating associations between individual- and pair-level covariates and shared membership in a transmission cluster. We utilize a large dataset (n=4,154) of spatially-referenced, genomically-sequenced Mycobacterium tuberculosis isolates collected from a single city for this analysis. We find that the case-control approach produces unbiased estimates of effect sizes with expected credible interval coverage and is more robust than the divide-and-conquer method when effect sizes are large. Thus, we recommend using the case-control approach with at least three controls per case to downscale datasets for pairwise analysis when analysis of the entire dataset is not possible. This approach mitigates the computational challenges of pairwise Bayesian modeling on datasets that require significant computational resources while maintaining desired inferential properties. Author SummaryPairwise analyses of large datasets to study pathogen transmission are computationally demanding because they typically require simultaneous analysis of each possible pair of individuals in a dataset; as datasets become larger these analyses often are not feasible to conduct even with access to high-performance computing resources. In this work, we compare a case-control approach and divide-and-conquer approaches for more efficient pairwise analysis of large datasets. Using a large dataset of Mycobacterium tuberculosis isolates including genetic and spatial data, we investigate the performance of each method for estimating the associations between host covariates and genetic clustering of isolates. We find that the case-control approach is generally preferred over methods which first divide the data into subsets and then combine results. While additional extensions of these analyses are needed to test the generality of these findings to other data settings, this work provides a practical way forward for the pairwise analysis of large datasets to study pathogen transmission.

Matching journals

The top 7 journals account for 50% of the predicted probability mass.

1
PLOS Computational Biology
1633 papers in training set
Top 0.9%
19.8%
2
Microbial Genomics
204 papers in training set
Top 0.2%
10.3%
3
PLOS ONE
4510 papers in training set
Top 23%
7.3%
4
PLOS Genetics
756 papers in training set
Top 4%
3.7%
5
Scientific Reports
3102 papers in training set
Top 34%
3.7%
6
PeerJ
261 papers in training set
Top 2%
3.7%
7
Bioinformatics
1061 papers in training set
Top 5%
3.7%
50% of probability mass above
8
G3 Genes|Genomes|Genetics
351 papers in training set
Top 0.8%
2.9%
9
Statistics in Medicine
34 papers in training set
Top 0.1%
2.8%
10
Genetics
225 papers in training set
Top 2%
2.8%
11
Microbiology
57 papers in training set
Top 0.4%
2.1%
12
G3
33 papers in training set
Top 0.2%
1.7%
13
Phytopathology®
28 papers in training set
Top 0.3%
1.7%
14
Journal of The Royal Society Interface
189 papers in training set
Top 3%
1.5%
15
Journal of Theoretical Biology
144 papers in training set
Top 1%
1.4%
16
mBio
750 papers in training set
Top 9%
1.2%
17
mSystems
361 papers in training set
Top 6%
1.1%
18
Bulletin of Mathematical Biology
84 papers in training set
Top 2%
0.9%
19
Methods in Ecology and Evolution
160 papers in training set
Top 2%
0.9%
20
BMC Research Notes
29 papers in training set
Top 0.3%
0.9%
21
BMC Genomics
328 papers in training set
Top 5%
0.8%
22
Malaria Journal
48 papers in training set
Top 1%
0.8%
23
Epidemics
104 papers in training set
Top 2%
0.8%
24
GigaScience
172 papers in training set
Top 3%
0.7%
25
F1000Research
79 papers in training set
Top 5%
0.7%
26
Journal of Microbiological Methods
11 papers in training set
Top 0.5%
0.7%
27
PLOS Neglected Tropical Diseases
378 papers in training set
Top 6%
0.7%
28
iScience
1063 papers in training set
Top 36%
0.7%
29
Microbiology Spectrum
435 papers in training set
Top 6%
0.7%
30
eLife
5422 papers in training set
Top 61%
0.7%