Back

Exploring the utility of regulatory network-based machine learning for gene expression prediction in maize

Ferebee, T. H.; Buckler, E.

2023-05-14 bioinformatics
10.1101/2023.05.11.540406 bioRxiv
Show abstract

Genomic selection and gene editing in crops could be enhanced by multi-species, mechanistic models predicting effects of changes in gene regulation. Current expression abundance prediction models require extensive computational resources, hard-to-measure species-specific training data, and often fail to incorporate data from multiple species. We hypothesize that gene expression prediction models that harness the regulatory network structure of Arabidopsis thaliana transcription factor-target gene interactions will improve on the present maize models. To this end, we collect 147 Oryza sativa and 99 Sorghum bicolor gene expression assays and assign them to maize family-based orthologous groups. Using three popular graph-based machine learning frameworks, including a shallow graph convolutional autoencoder, a deep graph convolutional autoencoder, and the inductive GraphSage strategy, we encode an Arabidopsis thaliana integrated gene regulatory network (iGRN) structure and TF gene expression values to predict gene expression both within and between species. We then evaluate the network methods against a partial least-squares baseline. We find that the baseline gives the best predictions within species, with Spearman correlations averaging between 0.74 and 0.78. The graph autoencoder methods were more variable with correlations between -0.1 and 0.65. In particular, the GraphSage and deep autoencoders performed the worst, and the shallow autoencoders performed the best. In the most challenging prediction context, where predictions were in new species and on genes that were not seen, we found that the shallow graph autoencoder framework averaged around 0.65. Unlike initial thoughts about preserved network structure improving gene expression predictions, this study shows that within-species predictions only need simple models, such as partial least squares, to capture expression variations. In cross-species predictions, the best model is often a more complex strategy utilizing regulatory network structure and other studies expressions.

Matching journals

The top 3 journals account for 50% of the predicted probability mass.

1
in silico Plants
24 papers in training set
Top 0.1%
38.0%
2
PLOS Computational Biology
1633 papers in training set
Top 4%
7.2%
3
Frontiers in Genetics
197 papers in training set
Top 1%
4.9%
50% of probability mass above
4
Computational and Structural Biotechnology Journal
216 papers in training set
Top 1%
3.7%
5
Bioinformatics Advances
184 papers in training set
Top 2%
3.3%
6
The Plant Genome
53 papers in training set
Top 0.3%
2.7%
7
PLOS ONE
4510 papers in training set
Top 48%
2.1%
8
BMC Bioinformatics
383 papers in training set
Top 4%
2.1%
9
Scientific Reports
3102 papers in training set
Top 55%
1.8%
10
Frontiers in Plant Science
240 papers in training set
Top 3%
1.8%
11
Bioinformatics
1061 papers in training set
Top 7%
1.8%
12
Nucleic Acids Research
1128 papers in training set
Top 10%
1.8%
13
Plant Physiology
217 papers in training set
Top 2%
1.7%
14
Briefings in Bioinformatics
326 papers in training set
Top 5%
1.3%
15
NAR Genomics and Bioinformatics
214 papers in training set
Top 2%
1.3%
16
GigaScience
172 papers in training set
Top 2%
1.3%
17
Nature Communications
4913 papers in training set
Top 57%
1.1%
18
Communications Biology
886 papers in training set
Top 17%
1.0%
19
Plant Communications
35 papers in training set
Top 1%
0.9%
20
npj Systems Biology and Applications
99 papers in training set
Top 2%
0.9%
21
G3 Genes|Genomes|Genetics
351 papers in training set
Top 2%
0.9%
22
Journal of Genetics and Genomics
36 papers in training set
Top 2%
0.8%
23
Theoretical and Applied Genetics
46 papers in training set
Top 0.4%
0.8%
24
Cell Systems
167 papers in training set
Top 11%
0.8%
25
Plant Biotechnology Journal
56 papers in training set
Top 1%
0.8%
26
Genome Research
409 papers in training set
Top 4%
0.8%
27
Advanced Science
249 papers in training set
Top 19%
0.8%
28
Heliyon
146 papers in training set
Top 7%
0.7%
29
Synthetic and Systems Biotechnology
10 papers in training set
Top 0.6%
0.7%
30
Genome Biology
555 papers in training set
Top 8%
0.7%