Back

Revisiting Reconstruction Likelihood: Variational Autoencoders for Biological and Biomedical Data Clustering

Korenic, A.; Özkaya, U.; Capar, A.

2026-04-12 bioinformatics
10.64898/2026.04.09.717460 bioRxiv
Show abstract

Background and ObjectiveVariational Autoencoders (VAEs) offer a powerful framework for unsupervised anomaly detection and data clustering, often surpassing traditional methods. A core strength of VAEs lies in their ability to model data distributions probabilistically, enabling robust identification of anomalies and clusters through reconstruction likelihood -- a stochastic metric providing a principled alternative to deterministic error scores. MethodsWe investigated how different VAE architectures, combining reconstruction likelihood with a learnable or data-driven prior, performed in a clustering task on a toy dataset such as MNIST. Results were verified using dimensionality reduction techniques like t-distributed Stochastic Neighbor Embedding (t-SNE) and Uniform Manifold Approximation and Projection (UMAP), alongside clustering algorithms such as k-means and Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN). ResultsThe VAEs encoder inherently maps data points into a latent space exhibiting discernible cluster structure, as evidenced by alignment with ground truth labels. While dimensionality reduction techniques (both t-SNE and UMAP) facilitated the application of clustering algorithms (k-means and HDBSCAN), these methods were primarily used to visualize and interpret the latent space organization. ConclusionsThis study demonstrates that VAEs effectively cluster data by implicitly encoding assignments in their latent representations. Determining cluster membership from encoder output, combined with reconstruction likelihood using semantic features, offers a principled approach for identifying typical samples and anomalies. Future research should focus on leveraging this inherent clustering capability of VAEs to enhance interpretability and facilitate clinical application.

Matching journals

The top 9 journals account for 50% of the predicted probability mass.

1
Bioinformatics
1061 papers in training set
Top 3%
10.1%
2
Scientific Reports
3102 papers in training set
Top 10%
8.4%
3
PLOS ONE
4510 papers in training set
Top 28%
6.4%
4
PLOS Computational Biology
1633 papers in training set
Top 6%
6.3%
5
IEEE Access
31 papers in training set
Top 0.1%
4.9%
6
BMC Medical Informatics and Decision Making
39 papers in training set
Top 0.7%
4.0%
7
Journal of Biomedical Informatics
45 papers in training set
Top 0.4%
3.6%
8
Computers in Biology and Medicine
120 papers in training set
Top 0.9%
3.6%
9
BMC Bioinformatics
383 papers in training set
Top 3%
3.6%
50% of probability mass above
10
Bioinformatics Advances
184 papers in training set
Top 2%
3.3%
11
Artificial Intelligence in Medicine
15 papers in training set
Top 0.2%
2.6%
12
Frontiers in Artificial Intelligence
18 papers in training set
Top 0.2%
2.4%
13
GigaScience
172 papers in training set
Top 0.9%
2.1%
14
Biology Methods and Protocols
53 papers in training set
Top 0.7%
1.9%
15
NeuroImage
813 papers in training set
Top 4%
1.7%
16
IEEE Journal of Biomedical and Health Informatics
34 papers in training set
Top 1%
1.7%
17
BioData Mining
15 papers in training set
Top 0.3%
1.7%
18
Computational and Structural Biotechnology Journal
216 papers in training set
Top 4%
1.7%
19
Patterns
70 papers in training set
Top 1.0%
1.7%
20
Frontiers in Genetics
197 papers in training set
Top 6%
1.5%
21
Acta Psychiatrica Scandinavica
10 papers in training set
Top 0.3%
0.9%
22
Communications Biology
886 papers in training set
Top 19%
0.9%
23
Computer Methods and Programs in Biomedicine
27 papers in training set
Top 1.0%
0.7%
24
Journal of Medical Internet Research
85 papers in training set
Top 5%
0.7%
25
PLOS Digital Health
91 papers in training set
Top 3%
0.7%
26
Nature Communications
4913 papers in training set
Top 64%
0.7%
27
Journal of the American Medical Informatics Association
61 papers in training set
Top 2%
0.7%
28
Life
27 papers in training set
Top 0.5%
0.7%
29
Frontiers in Physiology
93 papers in training set
Top 6%
0.7%
30
npj Digital Medicine
97 papers in training set
Top 4%
0.6%