Insertions, deletions, and exchangeable couplings: a Dirichlet process over TKF92 domains and sites
Large, A. L.; Holmes, I. H.
Show abstract
The TKF92 model of molecular evolution--a linear birth-death process for indels, with finite-state continuous-time Markov chain substitutions--is exchangeable in residue identity at every site: the generative process treats amino acids symmetrically, conditional on a single substitution rate matrix. To introduce local heterogeneity, evolutionary models are often equipped with site-class mixtures, preserving this symmetry in the sense of de Finetti: conditional on the latent class, residues are still exchangeable. In a four-step theoretical ladder, we show how long-range structure such as couplings between distant sites can also be introduced exchangeably by using a Dirichlet process to partition sites into co-evolving classes. Our first step is a thorough analysis of TKF92 to establish sufficient statistics, limiting behavior, and inferential tools. We then lift the pairwise TKF92 hidden Markov model, in the limit of small time, to a time-indexed gravestone-augmented pair stochastic context-free grammar, and from there to its phylogenetic generalisation. This framing allows trajectories to be sampled exactly by Inside-Outside recursion. The third step places a Dirichlet process over the alive sites and asks co-keyed sites to evolve under a sparse Potts interaction -- an exchangeably-partitioned hidden direct-coupling model whose marginal alignment likelihood is unchanged from plain TKF92. The fourth rung of the ladder develops inference machinery: a Gibbs-Metropolis sampler that alternates alignment resamples, key-partition resamples, and stochastic parameter updates. We close several gaps along the way -- exact closed-form sufficient statistics for the linear birth-death-immigration component, the resolvable LHopital limit at{lambda} =, and a closed-form M-step for a recursive generalisation of TKF92 -- and we report a 1,000-family Pfam fit with K=4 site classes whose Potts atoms carry [~]0.54 nats of covariation per class-pair on top of a substantial single-site substitution model. Supplementary material, including full source code for inference, may be found at https://tkfdp.net/.
Matching journals
The top 6 journals account for 50% of the predicted probability mass.