Back

Data Matters: The Impact of Data Curation in the Classification of Histopathological Datasets

Brito-Pacheco, D. A.; Giannopoulos, P.; Reyes-Aldasoro, C. C.

2026-04-17 pathology
10.64898/2026.04.16.26351016 medRxiv
Show abstract

In this work, the impact of outliers on the performance of machine learning and deep learning models is investigated, specifically for the case of histopathological images of colorectal cancer stained with Haematoxylin and Eosin. The evaluation of the impact is done through the systematic comparison of one machine learning model (Random Forests) and one deep learning model (ResNet-18). Both models were trained with the popular NCT-CRC-HE-VAL-100K dataset and tested on the CRC-HE-VAL-7K companion set. Then, a curation process was performed by analysing the divergence of patches based on chromatic, textural and topological features of the training set and removing outliers to repeat the training with a cleaned dataset. The results showed that machine learning models, can benefit more from improvements in the quality of data, than deep learning models. Further, the results suggest that deep learning models are more robust to outliers as, through the training process, the architectures can learn features other than those previously mentioned.

Matching journals

The top 4 journals account for 50% of the predicted probability mass.

1
Journal of Pathology Informatics
13 papers in training set
Top 0.1%
22.2%
2
PLOS ONE
4510 papers in training set
Top 11%
17.3%
3
Computers in Biology and Medicine
120 papers in training set
Top 0.1%
9.9%
4
Scientific Reports
3102 papers in training set
Top 7%
9.9%
50% of probability mass above
5
Cureus
67 papers in training set
Top 0.6%
6.3%
6
Cancers
200 papers in training set
Top 1%
4.8%
7
Biology Methods and Protocols
53 papers in training set
Top 0.1%
4.8%
8
GigaScience
172 papers in training set
Top 2%
1.3%
9
Journal of Medical Internet Research
85 papers in training set
Top 3%
1.3%
10
BMC Medical Informatics and Decision Making
39 papers in training set
Top 2%
1.2%
11
Journal of Medical Imaging
11 papers in training set
Top 0.2%
1.2%
12
Medical Image Analysis
33 papers in training set
Top 0.8%
0.9%
13
Heliyon
146 papers in training set
Top 5%
0.9%
14
IEEE Access
31 papers in training set
Top 0.9%
0.8%
15
Chaos: An Interdisciplinary Journal of Nonlinear Science
16 papers in training set
Top 0.3%
0.7%
16
Informatics in Medicine Unlocked
21 papers in training set
Top 1%
0.7%
17
Animals
20 papers in training set
Top 1%
0.6%
18
IEEE Journal of Biomedical and Health Informatics
34 papers in training set
Top 3%
0.6%
19
PLOS Digital Health
91 papers in training set
Top 3%
0.6%
20
Computational and Structural Biotechnology Journal
216 papers in training set
Top 11%
0.6%
21
Diagnostics
48 papers in training set
Top 3%
0.6%