Back

Exploring the impact of read clustering thresholds on RADseq-based systematics: an empirical example from European amphibians.

Rancilhac, L.; Sylvestre, F.; Hutter, C. R.; Arntzen, J. W.; Babik, W.; Crochet, P.-A.; Deso, G.; Duguet, R.; Galan, P.; Pabijan, M.; Policain, M.; Priol, P.; Sabino-Pinto, J.; Capstick, M.; Elmer, K. R.; Dufresnes, C.; Vences, M.

2023-04-20 evolutionary biology
10.1101/2023.04.19.537466 bioRxiv
Show abstract

Restriction site-Associated DNA sequencing (RADseq) has great potential for genome-wide systematics studies of non-model organisms. However, accurately assembling RADseq reads into orthologous loci remains a major challenge in the absence of a reference genome. Traditional assembly pipelines cluster putative orthologous sequences based on a user-defined clustering threshold. Because improper clustering of orthologs is expected to affect results in downstream analyses, it is crucial to design pipelines for empirically optimizing the clustering threshold. While this issue has been largely discussed from a population genomics perspective, it remains understudied in the context of phylogenomics and coalescent species delimitation. To address this issue, we generated RADseq assemblies of representatives of the amphibian genera Discoglossus, Rana, Lissotriton and Triturus using a wide range of clustering thresholds. Particularly, we studied the effects of the intra-sample Clustering Threshold (iCT) and between-sample Clustering Threshold (bCT) separately, as both are expected to differ in multi-species data sets. The obtained assemblies were used for downstream inference of concatenation-based phylogenies, and multi-species coalescent species trees and species delimitation. The results were evaluated in the light of a reference genome-wide phylogeny calculated from newly generated Hybrid-Enrichment markers, as well as extensive background knowledge on the species systematics. Overall, our analyses show that the inferred topologies and their resolution are resilient to changes of the iCT and bCT, regardless of the analytical method employed. Except for some extreme clustering thresholds, all assemblies yielded identical, well-supported inter-species relationships that were mostly congruent with those inferred from the reference Hybrid-Enrichment data set. Similarly, coalescent species delimitation was consistent among similarity threshold values. However, we identified a strong effect of the bCT on the branch lengths of concatenation and species trees, with higher bCTs yielding trees with shorter branches, which might be a pitfall for downstream inferences of evolutionary rates. Our results suggest that the choice of assembly parameters for RADseq data in the context of shallow phylogenomics might be less challenging than previously thought. Finally, we propose a pipeline for empirical optimization of the iCT and bCT, implemented in optiRADCT, a series of scripts readily usable for future RADseq studies.

Matching journals

The top 1 journal accounts for 50% of the predicted probability mass.

1
Molecular Ecology Resources
161 papers in training set
Top 0.1%
59.6%
50% of probability mass above
2
Methods in Ecology and Evolution
160 papers in training set
Top 0.8%
4.0%
3
Molecular Phylogenetics and Evolution
61 papers in training set
Top 0.1%
2.6%
4
Systematic Biology
121 papers in training set
Top 0.2%
2.1%
5
Molecular Ecology
304 papers in training set
Top 2%
1.9%
6
NAR Genomics and Bioinformatics
214 papers in training set
Top 2%
1.8%
7
PeerJ
261 papers in training set
Top 7%
1.7%
8
Molecular Biology and Evolution
488 papers in training set
Top 3%
1.7%
9
Peer Community Journal
254 papers in training set
Top 2%
1.7%
10
PLOS ONE
4510 papers in training set
Top 57%
1.5%
11
BMC Bioinformatics
383 papers in training set
Top 5%
1.3%
12
Journal of Heredity
35 papers in training set
Top 0.1%
1.3%
13
PLOS Computational Biology
1633 papers in training set
Top 18%
1.3%
14
BMC Genomics
328 papers in training set
Top 4%
1.2%
15
G3: Genes, Genomes, Genetics
222 papers in training set
Top 0.6%
1.2%
16
Genome Biology and Evolution
280 papers in training set
Top 2%
1.0%
17
BMC Ecology and Evolution
49 papers in training set
Top 1%
1.0%
18
Philosophical Transactions of the Royal Society B
51 papers in training set
Top 5%
0.9%
19
Scientific Reports
3102 papers in training set
Top 73%
0.8%
20
eLife
5422 papers in training set
Top 57%
0.8%
21
Applications in Plant Sciences
21 papers in training set
Top 0.3%
0.7%
22
GigaScience
172 papers in training set
Top 3%
0.7%
23
Genome Research
409 papers in training set
Top 5%
0.5%