Back

How to gain valuable insight from scarce data with Machine Learning: a post-hoc explanation tool to identify biases in biological images classification

Bolut, C.; Pacary, A.; Pieruccioni, L.; Ousset, M.; Paupert, J.; Casteilla, L.; Simoncini, D.

2026-02-20 bioinformatics
10.64898/2026.02.20.706981 bioRxiv
Show abstract

Machine learning (ML) models are effective at classifying images across various fields, including biology. However, their performance on biomedical images is often limited by the small size of available datasets that are constrained by the time-consuming and costly nature of experimental data collection. A review of the literature shows that many studies using biomedical images fail to follow ML best practices. This study focuses on regenerative medicine, which aims to promote tissue regeneration rather than scarring. To explore this process, we applied ML to a limited dataset of images of mice tissues, aiming to distinguish between regenerating and scarring samples. As expected binary classification failed to generalize to independent data. A novel SHAP-based analysis revealed that the overfitting models were based on spurious correlations including individual mice characteristics that aligned with the regeneration/scarring labels. The models appeared to be solving the binary classification task, but were in fact recognizing individuals. To investigate this behavior further, we examined the test set confusion matrix of a model trained to identify individual mice. We observed that, beyond individual recognition, individuals were grouped according to the time elapsed after injury (day 3 or 10) and the healing outcome (regeneration or scarring). We hypothesized that these groupings were based on relevant biological information captured by the model. To test this hypothesis, we successfully trained a model to classify images according to the time elapsed after injury (3 or 10 days), demonstrating that ML can extract relevant biological information when the task is aligned with what the data can actually support. Altogether, this study demonstrates that carefully examining explanations of a model is not only an effective way to unveil putative biases but also to extract relevant information from a limited dataset. Author summaryMachine learning is increasingly used to analyze biomedical images, but in many experimental settings only small datasets are available, which can easily mislead powerful models. In this study, we looked at images from mice tissues, with the goal to distinguish healing by regeneration from healing by scarring. Although standard machine learning models appeared to perform well during training, they failed to generalize to new animals. By carefully analyzing model explanations, we found that the models were not learning biologically meaningful patterns of tissue repair, but instead were recognizing individual mice based on subtle image-specific signatures. Importantly, this same analysis revealed that the models did capture relevant biological information when the task was better aligned with the data, such as distinguishing early versus late stages of healing. Our results highlight how explanation methods can uncover hidden biases, prevent false conclusions, and help researchers extract meaningful biological insights even from limited and imperfect datasets.

Matching journals

The top 7 journals account for 50% of the predicted probability mass.

1
Biology Methods and Protocols
53 papers in training set
Top 0.1%
15.1%
2
PLOS Computational Biology
1633 papers in training set
Top 3%
10.7%
3
PLOS ONE
4510 papers in training set
Top 24%
7.0%
4
Scientific Reports
3102 papers in training set
Top 16%
6.6%
5
Frontiers in Artificial Intelligence
18 papers in training set
Top 0.1%
3.8%
6
Frontiers in Bioinformatics
45 papers in training set
Top 0.1%
3.8%
7
GigaScience
172 papers in training set
Top 0.4%
3.7%
50% of probability mass above
8
Bioinformatics
1061 papers in training set
Top 6%
3.0%
9
PLOS Digital Health
91 papers in training set
Top 1%
1.9%
10
iScience
1063 papers in training set
Top 12%
1.8%
11
IEEE Access
31 papers in training set
Top 0.3%
1.7%
12
Patterns
70 papers in training set
Top 0.8%
1.7%
13
BMC Medical Informatics and Decision Making
39 papers in training set
Top 1%
1.7%
14
Bioengineering
24 papers in training set
Top 0.5%
1.5%
15
Artificial Intelligence in Medicine
15 papers in training set
Top 0.4%
1.4%
16
Cancers
200 papers in training set
Top 3%
1.4%
17
Frontiers in Physiology
93 papers in training set
Top 4%
1.3%
18
Computational and Structural Biotechnology Journal
216 papers in training set
Top 6%
1.3%
19
Journal of Pathology Informatics
13 papers in training set
Top 0.2%
1.3%
20
Bioinformatics Advances
184 papers in training set
Top 4%
1.1%
21
Computers in Biology and Medicine
120 papers in training set
Top 3%
1.1%
22
BMC Bioinformatics
383 papers in training set
Top 6%
0.9%
23
Frontiers in Genetics
197 papers in training set
Top 8%
0.9%
24
BioData Mining
15 papers in training set
Top 0.7%
0.8%
25
PeerJ
261 papers in training set
Top 13%
0.8%
26
Computer Methods and Programs in Biomedicine
27 papers in training set
Top 0.8%
0.8%
27
BMC Medical Research Methodology
43 papers in training set
Top 1%
0.8%
28
Journal of Medical Imaging
11 papers in training set
Top 0.3%
0.8%
29
Ecological Informatics
29 papers in training set
Top 0.7%
0.8%
30
eLife
5422 papers in training set
Top 56%
0.8%