Summary statistics and approximate bayesian computation are comparable to convolutional neural networks for inferring times to fixation
Roberts, M.; Josephs, E. B.
Show abstract
Detecting signatures of positive selection in genomes is a common application of population genetics and one of the most influential models for this task is the hard selective sweep where a de novo mutation rapidly fixes. Many statistics have been developed to detect hard sweeps, often attempting to summarize signatures left behind in the site frequency, spectrum, linkage disequilibrium, and haplotype frequency. However, potentially undiscovered signals could still exist. We attempted to test whether any undiscovered signatures of hard sweeps exist by comparing machine learning models, which can learn signatures from raw data without any prior knowledge, to known summary statistics for inferring the time to fixation (tf) of a hard sweep in a background of variable sweep ages (ta). Across approximately 200,000 simulations encompassing 5 different demographic scenarios of single panmictic populations, machine learning models trained directly on raw genotype data failed to better predict tf than methods based purely on common summary statistics. This suggests few undiscovered signals remain in single timepoint, single population genotype data that can better disentangle tf and ta of hard sweeps.
Matching journals
The top 5 journals account for 50% of the predicted probability mass.