Back

Insertions, deletions, and exchangeable couplings: a Dirichlet process over TKF92 domains and sites

Large, A. L.; Holmes, I. H.

2026-05-19 bioinformatics
10.64898/2026.05.16.725674 bioRxiv
Show abstract

The TKF92 model of molecular evolution--a linear birth-death process for indels, with finite-state continuous-time Markov chain substitutions--is exchangeable in residue identity at every site: the generative process treats amino acids symmetrically, conditional on a single substitution rate matrix. To introduce local heterogeneity, evolutionary models are often equipped with site-class mixtures, preserving this symmetry in the sense of de Finetti: conditional on the latent class, residues are still exchangeable. In a four-step theoretical ladder, we show how long-range structure such as couplings between distant sites can also be introduced exchangeably by using a Dirichlet process to partition sites into co-evolving classes. Our first step is a thorough analysis of TKF92 to establish sufficient statistics, limiting behavior, and inferential tools. We then lift the pairwise TKF92 hidden Markov model, in the limit of small time, to a time-indexed gravestone-augmented pair stochastic context-free grammar, and from there to its phylogenetic generalisation. This framing allows trajectories to be sampled exactly by Inside-Outside recursion. The third step places a Dirichlet process over the alive sites and asks co-keyed sites to evolve under a sparse Potts interaction -- an exchangeably-partitioned hidden direct-coupling model whose marginal alignment likelihood is unchanged from plain TKF92. The fourth rung of the ladder develops inference machinery: a Gibbs-Metropolis sampler that alternates alignment resamples, key-partition resamples, and stochastic parameter updates. We close several gaps along the way -- exact closed-form sufficient statistics for the linear birth-death-immigration component, the resolvable LHopital limit at{lambda} =, and a closed-form M-step for a recursive generalisation of TKF92 -- and we report a 1,000-family Pfam fit with K=4 site classes whose Potts atoms carry [~]0.54 nats of covariation per class-pair on top of a substantial single-site substitution model. Supplementary material, including full source code for inference, may be found at https://tkfdp.net/.

Matching journals

The top 6 journals account for 50% of the predicted probability mass.

1
Cell Systems
167 papers in training set
Top 0.3%
19.4%
2
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 7%
8.4%
3
Nature Communications
4913 papers in training set
Top 24%
8.2%
4
Molecular Biology and Evolution
488 papers in training set
Top 0.6%
6.8%
5
Science
429 papers in training set
Top 6%
4.9%
6
Nature Biotechnology
147 papers in training set
Top 2%
4.9%
50% of probability mass above
7
Nature
575 papers in training set
Top 6%
4.3%
8
Nature Genetics
240 papers in training set
Top 2%
3.9%
9
PLOS Computational Biology
1633 papers in training set
Top 9%
3.7%
10
Nature Methods
336 papers in training set
Top 3%
3.6%
11
Genetics
225 papers in training set
Top 2%
2.7%
12
Genome Biology
555 papers in training set
Top 3%
2.6%
13
The American Journal of Human Genetics
206 papers in training set
Top 2%
2.4%
14
Nature Computational Science
50 papers in training set
Top 0.3%
2.4%
15
Genome Research
409 papers in training set
Top 2%
2.1%
16
eLife
5422 papers in training set
Top 42%
1.7%
17
Bioinformatics
1061 papers in training set
Top 7%
1.7%
18
Systematic Biology
121 papers in training set
Top 0.3%
1.5%
19
Nature Microbiology
133 papers in training set
Top 3%
1.2%
20
Nature Machine Intelligence
61 papers in training set
Top 3%
0.8%
21
Virus Evolution
140 papers in training set
Top 1%
0.7%
22
PLOS Genetics
756 papers in training set
Top 16%
0.7%
23
Science Advances
1098 papers in training set
Top 33%
0.6%