Distances and their visualization in studies of spatial-temporal genetic variation using single nucleotide polymorphisms (SNPs)

Georges, A.; Mijangos, L.; Patel, H. R.; Aitkens, M.; Gruber, B. R.

2023-05-11 bioinformatics

10.1101/2023.03.22.533737 bioRxiv

Show abstract

O_LIDistance measures are widely used for examining genetic structure in datasets that comprise many individuals scored for a very large number of attributes. Genotype datasets composed of single nucleotide polymorphisms (SNPs) typically contain bi-allelic scores for tens of thousands if not hundreds of thousands of loci. C_LIO_LIWe examine the application of distance measures to SNP genotypes and sequence tag presence-absences (SilicoDArT) and use real datasets and simulated data to illustrate pitfalls in the application of genetic distances and their visualization. C_LIO_LIEuclidean Distance is the metric of choice in many distance studies. However, other measures may be preferable because of their underlying models of divergence, population demographic history and linkage disequilibrium, because it is desirable to down-weight joint absences, or because of other characteristics specific to the data or analyses. Distance measures for SNP genotype data that depend on the arbitrary choice of reference and alternate alleles (e.g. Bray-Curtis distance) should not be used. Careful consideration should be given to which state is scored zero when applying binary distance measures to sequence tag presence-absences (e.g. Jaccard distance). C_LIO_LIMissing values that arise in the SNP discovery process can cause displacement of affected individuals from their natural groupings and artificial inflation of confidence envelopes, leading to potential misinterpretation. Filtering on missing values then imputing those that remain avoids distortion in visual representations. Failure of a distance measure to conform to metric and Euclidean properties is important but only likely to create unacceptable outcomes in extreme cases. Lack of randomness in the selection of individuals (e.g. inclusion of sibs) and lack of independence of both individuals and loci (e.g. polymorphic haploblocks), can lead to substantial and otherwise inexplicable distortions of the visual representations and again, potential misinterpretation. C_LI

Distances and their visualization in studies of spatial-temporal genetic variation using single nucleotide polymorphisms (SNPs)

Matching journals