Back

Machine learning illuminates how diet influences the evolution of yeast galactose metabolism

Harrison, M.-C.; Ubbelohde, E. J.; LaBella, A. L.; Opulente, D. A.; Wolters, J. F.; Zhou, X.; Shen, X.-X.; Groenewald, M.; Hittinger, C. T.; Rokas, A.

2023-07-23 evolutionary biology
10.1101/2023.07.20.549758 bioRxiv
Show abstract

How genomic differences contribute to phenotypic differences across species is a major question in biology. The recently characterized genomes, isolation environments, and qualitative patterns of growth on 122 sources and conditions of 1,154 strains from 1,049 fungal species (nearly all known) in the subphylum Saccharomycotina provide a powerful, yet complex, dataset for addressing this question. In recent years, machine learning has been successfully used in diverse analyses of biological big data. Using a random forest classification algorithm trained on these genomic, metabolic, and/or environmental data, we predicted growth on several carbon sources and conditions with high accuracy from presence/absence patterns of genes and of growth in other conditions. Known structural genes involved in assimilation of these sources were important features contributing to prediction accuracy, whereas isolation environmental data were poor predictors. By further examining growth on galactose, we found that it can be predicted with high accuracy from either genomic (92.6%) or growth data in 120 other conditions (83.3%) but not from isolation environment data (65.7%). When we combined genomic and growth data, we noted that prediction accuracy was even higher (93.4%) and that, after the GALactose utilization genes, the most important feature for predicting growth on galactose was growth on galactitol. These data raised the hypothesis that several species in two orders, Serinales and Pichiales (containing Candida auris and the genus Ogataea, respectively), have an alternative galactose utilization pathway because they lack the GAL genes. Growth and biochemical assays of several of these species confirmed that they utilize galactose through an oxidoreductive D-galactose pathway, rather than the canonical GAL pathway. We conclude that machine learning is a powerful tool for investigating the evolution of the yeast genotype-phenotype map and that it can help uncover novel biology, even in well-studied traits.

Matching journals

The top 9 journals account for 50% of the predicted probability mass.

1
PLOS Computational Biology
1633 papers in training set
Top 2%
14.5%
2
Molecular Biology and Evolution
488 papers in training set
Top 0.5%
8.3%
3
eLife
5422 papers in training set
Top 14%
6.2%
4
Genetics
225 papers in training set
Top 1%
4.1%
5
mBio
750 papers in training set
Top 4%
4.1%
6
BMC Ecology and Evolution
49 papers in training set
Top 0.3%
4.1%
7
Frontiers in Fungal Biology
10 papers in training set
Top 0.1%
3.6%
8
PLOS Biology
408 papers in training set
Top 3%
3.5%
9
G3
33 papers in training set
Top 0.1%
3.2%
50% of probability mass above
10
Current Biology
596 papers in training set
Top 7%
2.8%
11
Yeast
15 papers in training set
Top 0.1%
2.8%
12
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 24%
2.7%
13
BMC Biology
248 papers in training set
Top 0.5%
2.6%
14
mSystems
361 papers in training set
Top 4%
2.4%
15
Molecular Ecology
304 papers in training set
Top 2%
2.1%
16
Genome Biology and Evolution
280 papers in training set
Top 0.8%
2.1%
17
NAR Genomics and Bioinformatics
214 papers in training set
Top 2%
1.9%
18
BMC Genomics
328 papers in training set
Top 2%
1.8%
19
GENETICS
189 papers in training set
Top 0.6%
1.8%
20
Scientific Reports
3102 papers in training set
Top 59%
1.7%
21
mSphere
281 papers in training set
Top 3%
1.7%
22
G3 Genes|Genomes|Genetics
351 papers in training set
Top 2%
1.5%
23
G3: Genes, Genomes, Genetics
222 papers in training set
Top 0.5%
1.5%
24
PLOS Genetics
756 papers in training set
Top 10%
1.5%
25
New Phytologist
309 papers in training set
Top 4%
1.3%
26
Nature Communications
4913 papers in training set
Top 55%
1.3%
27
Genome Research
409 papers in training set
Top 3%
1.2%
28
Frontiers in Genetics
197 papers in training set
Top 8%
0.9%
29
Frontiers in Microbiology
375 papers in training set
Top 8%
0.9%
30
PLOS ONE
4510 papers in training set
Top 66%
0.8%