Back

Linking Codon- and Protein-Level Mutation Scores to Population Genetics Reveals Heterogeneous Selection Efficiency Across Escherichia coli Lineages

Mischler, M.; Vigue, L.; Croce, G.; Weigt, M.; Tenaillon, O.

2026-03-18 genetics
10.64898/2026.03.16.711857 bioRxiv
Show abstract

Quantifying the selective effects of individual mutations is essential to understand how their population-wise frequencies evolve under natural selection and genetic drift. Large genomic datasets provide a real-life experiment that we exploit to characterize the efficiency of selection across different mutations types and populations. Using Direct Coupling Analysis, a model from statistical physics, we derive protein-informed scores for individual non-synonymous mutations identified in 81,440 Escherichia coli genomes. We show that these scores act as a latent variable capturing the probability that a mutation is beneficial, neutral, or mildly to highly deleterious. We contribute to the debate on the importance of synonymous mutations by demonstrating that their selection intensities span a single order of magnitude in the E. coli species, whereas non-synonymous mutations span six orders of magnitude. We further relate selection efficiency to genetic drift, defined as the inverse of population size, and to ecological lifestyle, and we identify a 10,000-fold reduction in selection efficiency between the entire E. coli species and its most pathogenic populations. Together, these results highlight how population genetics and protein variant fitness predictors inform one another: variation in selection efficiency is associated with shifts in the distribution of mutation scores, and population genetics data provide a benchmark to assess the accuracy of these scores. Graphical abstract O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=182 SRC="FIGDIR/small/711857v1_ufig1.gif" ALT="Figure 1"> View larger version (51K): org.highwire.dtl.DTLVardef@1df70corg.highwire.dtl.DTLVardef@1464860org.highwire.dtl.DTLVardef@139d4d3org.highwire.dtl.DTLVardef@1c3a4c5_HPS_FORMAT_FIGEXP M_FIG Schematic representation of the analysis of polymorphism in 81,440 Escherichia coli genomes. 458,443 polymorphic codon sites were identified and oriented using homologous sequences from closely related species. Mutations can be classified as synonymous or non-synonymous based on whether they alter the amino-acid sequence encoded, and real-valued scores predictive of fitness effects can be attributed to mutations within each of these classes. Codon scores reflect the global codon usage preference within the E. coli genome. DCA scores capture position- and amino-acid-specific preference as well as epistatic constraints and are obtained for each protein from a set of distantly related homologous sequences. Coupled with the abundance of polymorphic sites within different E. coli subpopulations, these different polymorphism classifications allow to precisely compare the intensity of selection between different types of mutations and across populations with distinct lifestyles, illustrated here by their pathogenic power. C_FIG

Matching journals

The top 5 journals account for 50% of the predicted probability mass.

1
Genetics
225 papers in training set
Top 0.1%
22.0%
2
Molecular Biology and Evolution
488 papers in training set
Top 0.4%
9.9%
3
Nucleic Acids Research
1128 papers in training set
Top 3%
6.7%
4
Genome Biology
555 papers in training set
Top 1%
6.2%
5
Molecular Systems Biology
142 papers in training set
Top 0.1%
6.2%
50% of probability mass above
6
PLOS Computational Biology
1633 papers in training set
Top 6%
6.2%
7
GENETICS
189 papers in training set
Top 0.1%
4.7%
8
Nature Communications
4913 papers in training set
Top 35%
4.2%
9
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 18%
3.9%
10
PLOS Genetics
756 papers in training set
Top 5%
3.5%
11
Nature Genetics
240 papers in training set
Top 3%
3.0%
12
Cell Genomics
162 papers in training set
Top 3%
1.8%
13
Science Advances
1098 papers in training set
Top 18%
1.7%
14
eLife
5422 papers in training set
Top 44%
1.6%
15
Nature Ecology & Evolution
113 papers in training set
Top 3%
1.2%
16
G3 Genes|Genomes|Genetics
351 papers in training set
Top 2%
1.2%
17
The American Journal of Human Genetics
206 papers in training set
Top 3%
0.9%
18
Virus Evolution
140 papers in training set
Top 1%
0.9%
19
Cell Reports
1338 papers in training set
Top 32%
0.8%
20
Genome Medicine
154 papers in training set
Top 8%
0.8%
21
Genome Biology and Evolution
280 papers in training set
Top 2%
0.8%
22
Science
429 papers in training set
Top 20%
0.7%
23
Genome Research
409 papers in training set
Top 4%
0.7%