Back

Integrative approaches to improve the informativeness of deep learning models for human complex diseases

Dey, K. K.; Kim, S. S.; Gazal, S.; Nasser, J.; Engreitz, J. M.; Price, A.

2020-09-09 genetics
10.1101/2020.09.08.288563 bioRxiv
Show abstract

Deep learning models have achieved great success in predicting genome-wide regulatory effects from DNA sequence, but recent work has reported that SNP annotations derived from these predictions contribute limited unique information for human complex disease. Here, we explore three integrative approaches to improve the disease informativeness of allelic-effect annotations (predicted difference between reference and variant alleles) constructed using several previously trained deep learning models: DeepSEA, Basenji and DeepBind (and a related machine learning model, deltaSVM). First, we employ gradient boosting to learn optimal combinations of deep learning annotations, using fine-mapped SNPs and matched control SNPs (on held-out chromosomes) for training. Second, we improve the specificity of these annotations by restricting them to SNPs implicated by (proximal and distal) SNP-to-gene (S2G) linking strategies, e.g. prioritizing SNPs involved in gene regulation. Third, we predict gene expression (and derive allelic-effect annotations) from deep learning annotations at SNPs implicated by S2G linking strategies -- generalizing the previously proposed ExPecto approach, which incorporates deep learning annotations based on distance to TSS. We evaluated these approaches using stratified LD score regression, using functional data in blood and focusing on 11 autoimmune diseases and blood-related traits (average N =306K). We determined that the three approaches produced SNP annotations that were uniquely informative for these diseases/traits, despite the fact that linear combinations of the underlying DeepSEA, Basenji, DeepBind and deltaSVM blood annotations were not uniquely informative for these diseases/traits. Our results highlight the benefits of integrating SNP annotations produced by deep learning models with other types of data, including data linking SNPs to genes.

Matching journals

The top 5 journals account for 50% of the predicted probability mass.

1
The American Journal of Human Genetics
206 papers in training set
Top 0.3%
14.3%
2
Human Genetics and Genomics Advances
70 papers in training set
Top 0.1%
12.0%
3
Frontiers in Genetics
197 papers in training set
Top 0.2%
12.0%
4
Bioinformatics
1061 papers in training set
Top 3%
8.2%
5
Nature Communications
4913 papers in training set
Top 27%
6.6%
50% of probability mass above
6
Nucleic Acids Research
1128 papers in training set
Top 5%
4.2%
7
Cell Genomics
162 papers in training set
Top 1%
3.9%
8
Nature Genetics
240 papers in training set
Top 2%
3.5%
9
PLOS Computational Biology
1633 papers in training set
Top 11%
3.0%
10
Scientific Reports
3102 papers in training set
Top 46%
2.5%
11
Genome Research
409 papers in training set
Top 2%
2.0%
12
Genome Medicine
154 papers in training set
Top 4%
2.0%
13
Genetic Epidemiology
46 papers in training set
Top 0.4%
1.8%
14
Communications Biology
886 papers in training set
Top 8%
1.7%
15
eLife
5422 papers in training set
Top 40%
1.7%
16
Genome Biology
555 papers in training set
Top 5%
1.6%
17
Bioinformatics Advances
184 papers in training set
Top 3%
1.4%
18
BMC Bioinformatics
383 papers in training set
Top 5%
1.3%
19
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 37%
1.3%
20
Human Molecular Genetics
130 papers in training set
Top 2%
1.3%
21
Genetics
225 papers in training set
Top 3%
1.2%
22
Briefings in Bioinformatics
326 papers in training set
Top 5%
1.2%
23
BMC Genomics
328 papers in training set
Top 4%
0.9%
24
NAR Genomics and Bioinformatics
214 papers in training set
Top 3%
0.9%
25
PLOS Genetics
756 papers in training set
Top 14%
0.8%
26
Frontiers in Immunology
586 papers in training set
Top 8%
0.7%