Back

Improved inference of multiscale sequence statistics in generative protein models

Chauveau, M.; Kleeorin, Y.; Hinds, E.; Junier, I.; Ranganathan, R.; Rivoire, O.

2026-04-09 systems biology
10.64898/2026.04.06.716859 bioRxiv
Show abstract

High dimensionality and multiscale statistical structure are pervasive features of biological data, posing fundamental challenges for modeling. Because model inference generally proceeds with far fewer data than parameters, statistical patterns across scales are often unevenly represented. Protein sequences provide a paradigmatic example: statistics across homologs are inherently multiscale, displaying collective correlations among conserved residue sectors that encode function, alongside localized correlations corresponding to physical contacts outside these sectors. Standard regularization strategies used to mitigate undersampling during model inference have been shown to capture these patterns unevenly, a bias that compromises generative models of protein sequences by limiting their ability to produce both functional and diverse proteins. This limitation is exemplified by Boltzmann machine-based generative models, which so far have required post hoc corrections to recover functionality, at the cost of reduced sequence diversity. Here, we introduce the stochastic Boltzmann Machine (sBM), a new regularization strategy that more accurately captures different correlation scales. Through analyses of theoretical models with known ground-truth parameters and experiments on the chorismate mutase family, we show that sBM effectively mitigates distortions in the estimation of model parameters, enabling the generation of functional sequences with greater diversity and without the need for post hoc corrections. These results advance the inference of generative models that more faithfully reflect the evolutionary constraints shaping protein sequences.

Matching journals

The top 5 journals account for 50% of the predicted probability mass.

1
Cell Systems
167 papers in training set
Top 0.3%
22.0%
2
PLOS Computational Biology
1633 papers in training set
Top 2%
14.0%
3
Nature Communications
4913 papers in training set
Top 25%
7.0%
4
Nature Methods
336 papers in training set
Top 2%
6.2%
5
Nature Computational Science
50 papers in training set
Top 0.1%
4.7%
50% of probability mass above
6
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 15%
4.7%
7
Bioinformatics
1061 papers in training set
Top 5%
4.7%
8
Nucleic Acids Research
1128 papers in training set
Top 6%
3.5%
9
Communications Biology
886 papers in training set
Top 6%
2.0%
10
eLife
5422 papers in training set
Top 37%
2.0%
11
Nature Machine Intelligence
61 papers in training set
Top 2%
1.8%
12
BMC Bioinformatics
383 papers in training set
Top 4%
1.8%
13
Bioinformatics Advances
184 papers in training set
Top 3%
1.7%
14
Scientific Reports
3102 papers in training set
Top 60%
1.7%
15
Journal of The Royal Society Interface
189 papers in training set
Top 3%
1.3%
16
Nature Biotechnology
147 papers in training set
Top 5%
1.3%
17
Cell Reports
1338 papers in training set
Top 28%
1.2%
18
Genome Biology
555 papers in training set
Top 6%
1.2%
19
Acta Crystallographica Section D Structural Biology
54 papers in training set
Top 0.4%
0.7%
20
npj Systems Biology and Applications
99 papers in training set
Top 3%
0.7%
21
Frontiers in Molecular Biosciences
100 papers in training set
Top 5%
0.7%
22
PLOS ONE
4510 papers in training set
Top 70%
0.7%
23
Nature
575 papers in training set
Top 16%
0.7%
24
Biophysical Journal
545 papers in training set
Top 6%
0.6%
25
Briefings in Bioinformatics
326 papers in training set
Top 8%
0.6%
26
Cell Reports Methods
141 papers in training set
Top 6%
0.6%
27
Computational and Structural Biotechnology Journal
216 papers in training set
Top 11%
0.6%