Back

Transfer learning improves outcome predictions for ASD from gene expression in blood

Robasky, K.; Kim, R.; Yi, H.; Xu, H.; Bao, B.; Chang, A. W. T.; Courchesne, E.; Lewis, N. E.

2021-06-29 bioinformatics
10.1101/2021.06.26.449864 bioRxiv
Show abstract

BackgroundPredicting outcomes on human genetic studies is difficult because the number of variables (genes) is often much larger than the number of observations (human subject tissue samples). We investigated means for improving model performance on the types of under-constrained problems that are typical in human genetics, where the number of strongly correlated genes (features) may exceed 10,000, and the number of study participants (observations) may be limited to under 1,000. MethodsWe created train, validate and test datasets from 240 microarray observations from 127 subjects diagnosed with autism spectrum disorder (ASD) and 113 typically developing (TD) subjects. We trained a neural network model (a.k.a., the naive model) on 10,422 genes using the train dataset, composed of 70 ASD and 65 TD subjects, and we restricted the model to one, fully-connected hidden layer to minimize the number of trainable parameters, including a dropout layer to help prevent overfitting. We experimented with alternative network architectures and tuned the hyperparameters using the validate dataset, and performed a single, final evaluation using the holdout test dataset. Next, we trained a neural network model using the identical architecture and identical genes to predict tissue type in GTEx data. We transferred that learning by replacing the top layer of the GTEx model with a layer to predict ASD outcome and we retrained the new layer on the ASD dataset, again using the identical 10,422 genes. FindingsThe naive neural network model had AUROC=0.58 for the task of predicting ASD outcomes, which saw a statistically significant 7.8% improvement from transfer learning. InterpretationWe demonstrated that neural network learning could be transferred from models trained on large RNA-Seq gene expression to a model trained on a small, microarray gene expression dataset with clinical utility for mitigating over-training on small sample sizes. Incidentally, we built a highly accurate classifier of tissue type with which to perform the transfer learning. FundingThis work was supported in part by NIMH R01-MH110558 (E.C., N.E.L.) Author SummaryImage recognition and natural language processing have enjoyed great success in reusing the computational efforts and data sources to overcome the problem of over-training a neural network on a limited dataset. Other domains using deep learning, including genomics and clinical applications, have been slower to benefit from transfer learning. Here we demonstrate data preparation and modeling techniques that allow genomics researchers to take advantage of transfer learning in order to increase the utility of limited clinical datasets. We show that a non-pre-trained, naive model performance can be improved by 7.8% by transferring learning from a highly performant model trained on GTEx data to solve a similar problem.

Matching journals

The top 7 journals account for 50% of the predicted probability mass.

1
BMC Bioinformatics
383 papers in training set
Top 0.8%
10.2%
2
BioData Mining
15 papers in training set
Top 0.1%
10.2%
3
Bioinformatics
1061 papers in training set
Top 3%
10.2%
4
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.3%
8.5%
5
Bioinformatics Advances
184 papers in training set
Top 0.4%
6.4%
6
PLOS Computational Biology
1633 papers in training set
Top 8%
4.4%
7
Scientific Reports
3102 papers in training set
Top 41%
3.1%
50% of probability mass above
8
GigaScience
172 papers in training set
Top 0.8%
2.5%
9
PLOS ONE
4510 papers in training set
Top 50%
1.9%
10
Cell Genomics
162 papers in training set
Top 3%
1.7%
11
Frontiers in Genetics
197 papers in training set
Top 5%
1.7%
12
Biological Psychiatry
119 papers in training set
Top 2%
1.7%
13
American Journal of Medical Genetics Part B: Neuropsychiatric Genetics
22 papers in training set
Top 0.3%
1.1%
14
Briefings in Bioinformatics
326 papers in training set
Top 5%
1.1%
15
Autism Research
32 papers in training set
Top 0.3%
1.0%
16
Trials
25 papers in training set
Top 1%
1.0%
17
npj Genomic Medicine
33 papers in training set
Top 0.6%
1.0%
18
Journal of Biomedical Informatics
45 papers in training set
Top 1%
0.9%
19
Patterns
70 papers in training set
Top 2%
0.9%
20
Database
51 papers in training set
Top 0.8%
0.8%
21
Nature Communications
4913 papers in training set
Top 61%
0.8%
22
F1000Research
79 papers in training set
Top 4%
0.8%
23
BMC Medical Genomics
36 papers in training set
Top 1%
0.8%
24
Human Genetics and Genomics Advances
70 papers in training set
Top 0.8%
0.8%
25
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 44%
0.8%
26
Frontiers in Psychiatry
83 papers in training set
Top 3%
0.7%
27
Genetics in Medicine
69 papers in training set
Top 1%
0.7%
28
NeuroImage
813 papers in training set
Top 6%
0.7%
29
Biology Methods and Protocols
53 papers in training set
Top 3%
0.7%
30
eBioMedicine
130 papers in training set
Top 5%
0.7%