Back

Verticall: A fast and robust tool for recombination detection in large-scale bacterial genomic datasets

Odih, E. E.; Wick, R. R.; Holt, K. E.

2026-04-24 bioinformatics
10.64898/2026.04.21.719734 bioRxiv
Show abstract

The inference and removal of horizontally acquired genomic regions is a crucial step in phylogenomics analyses for evolutionary studies. Existing tools perform well on clonal lineage-focused datasets on the scale of hundreds of genomes, but are limited in their ability to analyse larger or more diverse datasets. Here we present Verticall, a tool to identify recombinant regions in bacterial assemblies and generate recombination-free phylogenies, which scales to thousands of genomes from clonal to genus-level diversity. Verticall uses a non-parametric approach to assign genomic regions as horizontally or vertically related based on the distribution of pairwise genetic distances between genomes. Recombination-free phylogenetic trees may be inferred by either calculating a pairwise genetic distance matrix from vertical-only regions (distance-tree approach) or by pairwise comparisons of all genomes to a reference and then masking horizontally acquired regions in a pseudo-alignment to the reference (alignment-tree approach). We demonstrate Verticalls performance using four publicly available whole-genome sequence datasets of varying sample sizes (range: 154 - 4,857 genomes) and evolutionary scales (ranging from within-lineage to genus-wide diversity). Across all four datasets, Verticall showed comparable or superior performance to the established tools Gubbins and ClonalFrameML in terms of computational efficiency, plausibility of inferred phylogenetic trees, and recovery of temporal signal for molecular dating. Our results show that Verticall is a useful tool to more efficiently and accurately detect recombination, particularly applied to datasets for which existing tools are limited, including large datasets with hundreds to thousands of genomes and those that span entire species or genera. Verticall is available free and open source at https://github.com/rrwick/Verticall. Impact StatementMany bacterial species can acquire genetic material from external sources and stably incorporate them into their own genomes through homologous recombination. During phylogenomic analyses to investigate outbreaks or for evolutionary studies, a core objective is often to reconstruct the evolutionary history of the studied organisms independent of these horizontally acquired genomic regions. This is particularly desirable when the aim is to construct dated phylogenies, as horizontally acquired variation can interfere with the molecular clock signal on which dating relies. Existing recombination detection programs perform well in certain contexts, but their algorithms are not suitable for datasets with very high diversity or thousands of genomes. We addressed this gap by developing the software package Verticall. We show this approach produces comparable results to existing software for smaller more clonal datasets, but also performs well on datasets that the existing packages cannot handle. Data SummaryVerticall is available free and open source at https://github.com/rrwick/Verticall. We used published whole-genome sequence data deposited in public databases (Pathogenwatch [https://pathogen.watch/]; European Nucleotide Archive [https://www.ebi.ac.uk/ena/], Sequence Read Archive [https://www.ncbi.nlm.nih.gov/sra/]). Accession numbers for the raw whole-genome sequences are presented in Tables S2-S6. All data, code, and analysis commands used to generate the results and figures presented in this paper are available on figshare (DOI: 10.6084/m9.figshare.31930821) and GitHub (https://github.com/erkison/verticall_paper).

Matching journals

The top 2 journals account for 50% of the predicted probability mass.

1
Bioinformatics
1061 papers in training set
Top 0.3%
44.1%
2
Molecular Biology and Evolution
488 papers in training set
Top 0.3%
10.7%
50% of probability mass above
3
Methods in Ecology and Evolution
160 papers in training set
Top 0.7%
4.4%
4
PLOS Computational Biology
1633 papers in training set
Top 9%
3.8%
5
BMC Bioinformatics
383 papers in training set
Top 3%
3.3%
6
Genome Research
409 papers in training set
Top 1%
2.9%
7
Genome Biology
555 papers in training set
Top 3%
2.2%
8
Bioinformatics Advances
184 papers in training set
Top 2%
1.9%
9
Virus Evolution
140 papers in training set
Top 0.7%
1.8%
10
Nucleic Acids Research
1128 papers in training set
Top 10%
1.8%
11
Microbial Genomics
204 papers in training set
Top 1%
1.8%
12
Systematic Biology
121 papers in training set
Top 0.3%
1.4%
13
PLOS ONE
4510 papers in training set
Top 57%
1.4%
14
Journal of Open Source Software
22 papers in training set
Top 0.1%
1.3%
15
Nature Communications
4913 papers in training set
Top 56%
1.3%
16
PLOS Genetics
756 papers in training set
Top 12%
1.0%
17
Nature Biotechnology
147 papers in training set
Top 6%
0.9%
18
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 40%
0.9%
19
mSphere
281 papers in training set
Top 5%
0.8%
20
NAR Genomics and Bioinformatics
214 papers in training set
Top 3%
0.8%
21
PeerJ
261 papers in training set
Top 14%
0.8%
22
Briefings in Bioinformatics
326 papers in training set
Top 6%
0.8%
23
Nature Methods
336 papers in training set
Top 7%
0.7%
24
Molecular Ecology Resources
161 papers in training set
Top 1%
0.5%
25
Microbiome
139 papers in training set
Top 4%
0.5%
26
Peer Community Journal
254 papers in training set
Top 5%
0.5%