Back

Hypothesis Generation For Rare and Undiagnosed Diseases Through Clustering and Classifying Time-Versioned Biological Ontologies

Bradshaw, M. S.; Gibbs, C.; Martin, S.; Firman, T.; Gaskell, A.; Fosdick, B.; Layer, R. M.

2023-11-13 bioinformatics
10.1101/2023.11.09.566432 bioRxiv
Show abstract

Rare diseases affect 1-in-10 people in the United States and despite increased genetic testing, up to half never receive a diagnosis. Even when using advanced genome sequencing platforms to discover variants, if there is no connection between the variants found in the patients genome and their phe-notypes in the literature, then the patient will remain undiagnosed. When a direct variant-phenotype connection is not known, putting a patients information in the larger context of phenotype relation-ships and protein-protein-interactions may provide an opportunity to find an indirect explanation. Databases such as STRING contain millions of protein-protein-interactions and HPO contains the relations of thousands of phenotypes. By integrating these networks and clustering the entities within we can potentially discover latent gene-to-phenotype connections. The historical records for STRING and HPO provide a unique opportunity to create a network time series for evaluating the cluster sig-nificance. Most excitingly, working with Childrens Hospital Colorado we provide promising hy-potheses about latent gene-to-phenotype connections for 38 patients with undiagnosed diseases. We also provide potential answers for 14 patients listed on MyGene2. Clusters our tool finds significant harbor 2.35 to 8.72 times as many gene-to-phenotypes edges inferred from known drug interactions than clusters find to be insignificant. Our tool, BOCC, is available as a web app and command line tool.

Matching journals

The top 4 journals account for 50% of the predicted probability mass.

1
Bioinformatics
1061 papers in training set
Top 1%
18.7%
2
BioData Mining
15 papers in training set
Top 0.1%
14.4%
3
BMC Bioinformatics
383 papers in training set
Top 0.9%
10.1%
4
Bioinformatics Advances
184 papers in training set
Top 0.4%
6.8%
50% of probability mass above
5
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.5%
6.4%
6
PLOS Computational Biology
1633 papers in training set
Top 6%
6.3%
7
PLOS ONE
4510 papers in training set
Top 32%
4.9%
8
BMC Medical Genomics
36 papers in training set
Top 0.3%
2.1%
9
Database
51 papers in training set
Top 0.3%
2.1%
10
Frontiers in Genetics
197 papers in training set
Top 5%
1.7%
11
GigaScience
172 papers in training set
Top 1%
1.7%
12
BMC Genomics
328 papers in training set
Top 3%
1.5%
13
Genome Research
409 papers in training set
Top 3%
1.5%
14
iScience
1063 papers in training set
Top 27%
0.9%
15
NAR Genomics and Bioinformatics
214 papers in training set
Top 3%
0.8%
16
Scientific Reports
3102 papers in training set
Top 73%
0.8%
17
Cell Systems
167 papers in training set
Top 11%
0.8%
18
Nucleic Acids Research
1128 papers in training set
Top 17%
0.7%
19
F1000Research
79 papers in training set
Top 5%
0.7%
20
Genome Biology
555 papers in training set
Top 8%
0.7%
21
Genome Medicine
154 papers in training set
Top 8%
0.7%
22
Patterns
70 papers in training set
Top 3%
0.6%
23
Journal of Computational Biology
37 papers in training set
Top 0.7%
0.6%