Back

Biological machine learning combined with bacterial population genomics reveals common and rare allelic variants of genes to cause disease

Bandoy, D. D. R.; Weimer, B. C.

2019-08-20 genomics
10.1101/739540 bioRxiv
Show abstract

Highly dimensional data generated from bacterial whole genome sequencing is providing unprecedented scale of information that requires appropriate statistical frameworks of analysis to infer biological function from bacterial genomic populations. Application of genome wide association study (GWAS) methods is an emerging approach with bacterial population genomics that yields a list of genes associated with a phenotype with an undefined importance among the candidates in the list. Here, we validate the combination of GWAS, machine learning, and pathogenic bacterial population genomics as a novel scheme to identify SNPs and rank allelic variants to determine associations for accurate estimation of disease phenotype. This approach parsed a dataset of 1.2 million SNPs that resulted in a ranked importance of associated alleles of Campylobacter jejuni porA using multiple spatial locations over a 30-year period. We validated this approach using previously proven laboratory experimental alleles from an in vivo guinea pig abortion model. This approach, termed BioML, defined intestinal and extraintestinal groups that have differential allelic variants that cause abortion. Divergent variants containing indels that defeated gene callers were rescued using biological context and knowledge that resulted in defining rare and divergent variants that were maintained in the population over two continents and 30 years. This study defines the capability of machine learning coupled to GWAS and population genomics to simultaneously identify and rank alleles to define their role in abortion, and more broadly infectious disease.

Matching journals

The top 5 journals account for 50% of the predicted probability mass.

1
Cell Genomics
162 papers in training set
Top 0.1%
36.8%
2
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 16%
4.2%
3
Genome Medicine
154 papers in training set
Top 2%
4.2%
4
PLOS Genetics
756 papers in training set
Top 5%
3.5%
5
Genome Biology
555 papers in training set
Top 3%
3.5%
50% of probability mass above
6
Cell Reports Methods
141 papers in training set
Top 1%
3.0%
7
Cell Systems
167 papers in training set
Top 5%
2.7%
8
eLife
5422 papers in training set
Top 32%
2.7%
9
Nature Communications
4913 papers in training set
Top 45%
2.5%
10
PLOS Computational Biology
1633 papers in training set
Top 12%
2.5%
11
Cell
370 papers in training set
Top 9%
2.3%
12
mSystems
361 papers in training set
Top 4%
2.0%
13
Frontiers in Genetics
197 papers in training set
Top 4%
1.8%
14
Nucleic Acids Research
1128 papers in training set
Top 10%
1.8%
15
PLOS Pathogens
721 papers in training set
Top 6%
1.6%
16
Microbial Genomics
204 papers in training set
Top 1%
1.6%
17
Nature Methods
336 papers in training set
Top 4%
1.6%
18
Cell Reports
1338 papers in training set
Top 25%
1.6%
19
Science Advances
1098 papers in training set
Top 21%
1.4%
20
Nature Genetics
240 papers in training set
Top 6%
1.2%
21
BMC Genomics
328 papers in training set
Top 4%
1.2%
22
Frontiers in Immunology
586 papers in training set
Top 7%
0.9%
23
iScience
1063 papers in training set
Top 28%
0.9%
24
Science
429 papers in training set
Top 19%
0.8%
25
Nature Biotechnology
147 papers in training set
Top 7%
0.8%
26
Scientific Reports
3102 papers in training set
Top 76%
0.7%
27
Nature
575 papers in training set
Top 16%
0.7%
28
Nature Machine Intelligence
61 papers in training set
Top 4%
0.7%
29
Microbiome
139 papers in training set
Top 3%
0.7%
30
NAR Genomics and Bioinformatics
214 papers in training set
Top 4%
0.7%