Back

Analysis and Augmentation of Small Datasets with Unsupervised Machine Learning

Dolgikh, S.

2021-04-25 health informatics
10.1101/2021.04.21.21254796 medRxiv
Show abstract

Analysis of small datasets presents a number of essential challenges not in the least due to insufficient sampling of characteristic patterns in the data making confident conclusions about the unknown distribution elusive and resulting in lower statistical confidence and higher error. In this work, a novel approach to augmentation of small datasets is proposed based on an ensemble of neural network models of unsupervised generative self-learning. Applying generative learning with an ensemble of individual models allowed to identify stable clusters of data points in the latent representations of the observable data. Several techniques of augmentation based on identified latent cluster structure were applied to produce new data points and enhance the dataset. The proposed method can be used with small and extremely small datasets to identify characteristics patterns, augment data and in some cases, improve accuracy of classification in the scenarios with strong deficit of labels.

Matching journals

The top 10 journals account for 50% of the predicted probability mass.

1
PLOS ONE
4510 papers in training set
Top 20%
9.9%
2
Scientific Reports
3102 papers in training set
Top 7%
9.9%
3
Computers in Biology and Medicine
120 papers in training set
Top 0.4%
6.2%
4
Chaos, Solitons & Fractals
32 papers in training set
Top 0.3%
6.2%
5
BMC Medical Informatics and Decision Making
39 papers in training set
Top 0.7%
3.9%
6
Mathematics
11 papers in training set
Top 0.1%
3.9%
7
Frontiers in Artificial Intelligence
18 papers in training set
Top 0.1%
3.6%
8
Physical Biology
43 papers in training set
Top 0.5%
3.5%
9
Expert Systems with Applications
11 papers in training set
Top 0.1%
2.5%
10
Informatics in Medicine Unlocked
21 papers in training set
Top 0.3%
2.5%
50% of probability mass above
11
Biomedical Signal Processing and Control
18 papers in training set
Top 0.2%
2.0%
12
Cognitive Neurodynamics
15 papers in training set
Top 0.1%
2.0%
13
Frontiers in Applied Mathematics and Statistics
10 papers in training set
Top 0.1%
1.8%
14
Computer Methods and Programs in Biomedicine
27 papers in training set
Top 0.3%
1.8%
15
International Journal of Medical Informatics
25 papers in training set
Top 0.7%
1.8%
16
IEEE Journal of Biomedical and Health Informatics
34 papers in training set
Top 0.9%
1.8%
17
Journal of Medical Internet Research
85 papers in training set
Top 2%
1.7%
18
Bioinformatics
1061 papers in training set
Top 7%
1.7%
19
Physica A: Statistical Mechanics and its Applications
10 papers in training set
Top 0.1%
1.7%
20
BMC Bioinformatics
383 papers in training set
Top 5%
1.7%
21
Sensors
39 papers in training set
Top 1%
1.6%
22
IEEE Access
31 papers in training set
Top 0.4%
1.6%
23
Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences
15 papers in training set
Top 0.5%
1.3%
24
JMIR Public Health and Surveillance
45 papers in training set
Top 3%
1.2%
25
Artificial Intelligence in Medicine
15 papers in training set
Top 0.5%
1.2%
26
PLOS Digital Health
91 papers in training set
Top 2%
1.2%
27
Heliyon
146 papers in training set
Top 5%
0.9%
28
BioMed Research International
25 papers in training set
Top 3%
0.9%
29
Computational and Structural Biotechnology Journal
216 papers in training set
Top 9%
0.8%
30
Biology Methods and Protocols
53 papers in training set
Top 3%
0.7%