Back

Non-parametric Bayesian density estimation for biological sequence space with applications to pre-mRNA splicing and the karyotypic diversity of human cancer

Chen, W.-C.; Zhou, J.; Sheltzer, J. M.; Kinney, J. B.; McCandlish, D. M.

2020-11-27 bioinformatics
10.1101/2020.11.25.399253 bioRxiv
Show abstract

Density estimation in sequence space is a fundamental problem in machine learning that is of great importance in computational biology. Due to the discrete nature and large dimensionality of sequence space, how best to estimate such probability distributions from a sample of observed sequences remains unclear. One common strategy for addressing this problem is to estimate the probability distribution using maximum entropy, i.e. calculating point estimates for some set of correlations based on the observed sequences and predicting the probability distribution that is as uniform as possible while still matching these point estimates. Building on recent advances in Bayesian field-theoretic density estimation, we present a generalization of this maximum entropy approach that provides greater expressivity in regions of sequence space where data is plentiful while still maintaining a conservative maximum entropy char-acter in regions of sequence space where data is sparse or absent. In particular, we define a family of priors for probability distributions over sequence space with a single hyper-parameter that controls the expected magnitude of higher-order correlations. This family of priors then results in a corresponding one-dimensional family of maximum a posteriori estimates that interpolate smoothly between the maximum entropy estimate and the observed sample frequencies. To demonstrate the power of this method, we use it to explore the high-dimensional geometry of the distribution of 5' splice sites found in the human genome and to understand the accumulation of chromosomal abnormalities during cancer progression.

Matching journals

The top 5 journals account for 50% of the predicted probability mass.

1
Bioinformatics
1061 papers in training set
Top 1.0%
23.2%
2
PLOS Computational Biology
1633 papers in training set
Top 2%
12.7%
3
Genetics
225 papers in training set
Top 0.5%
8.5%
4
BMC Bioinformatics
383 papers in training set
Top 2%
4.7%
5
The Annals of Applied Statistics
15 papers in training set
Top 0.1%
4.1%
50% of probability mass above
6
Cell Systems
167 papers in training set
Top 3%
3.7%
7
Frontiers in Genetics
197 papers in training set
Top 4%
1.8%
8
GENETICS
189 papers in training set
Top 0.6%
1.7%
9
NAR Genomics and Bioinformatics
214 papers in training set
Top 2%
1.7%
10
Nucleic Acids Research
1128 papers in training set
Top 10%
1.7%
11
Journal of Computational Biology
37 papers in training set
Top 0.2%
1.7%
12
Physical Review E
95 papers in training set
Top 0.7%
1.5%
13
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 34%
1.5%
14
Scientific Reports
3102 papers in training set
Top 61%
1.5%
15
Biometrics
22 papers in training set
Top 0.1%
1.5%
16
Biostatistics
21 papers in training set
Top 0.1%
1.4%
17
Biophysical Journal
545 papers in training set
Top 3%
1.4%
18
Frontiers in Molecular Biosciences
100 papers in training set
Top 2%
1.4%
19
Molecular Biology and Evolution
488 papers in training set
Top 3%
1.3%
20
Physical Biology
43 papers in training set
Top 1%
1.3%
21
Communications Biology
886 papers in training set
Top 15%
1.1%
22
Bioinformatics Advances
184 papers in training set
Top 4%
1.0%
23
Nature Biotechnology
147 papers in training set
Top 6%
1.0%
24
Nature Communications
4913 papers in training set
Top 59%
0.9%
25
PLOS ONE
4510 papers in training set
Top 65%
0.8%
26
Genome Research
409 papers in training set
Top 4%
0.8%
27
BioData Mining
15 papers in training set
Top 0.8%
0.8%
28
Journal of Bioinformatics and Systems Biology
14 papers in training set
Top 0.6%
0.8%
29
eLife
5422 papers in training set
Top 58%
0.7%
30
Briefings in Bioinformatics
326 papers in training set
Top 7%
0.7%