Vision Transformers Based AI Models For Predicting Colorectal Cancer from Digital Pathology WSI: Use Case Of MHIST dataset

Kondejkar, T.; Tunik, G.; Amal, S.

2026-02-04 gastroenterology

10.64898/2026.02.03.26345516 medRxiv

Show abstract

This study investigates the efficacy of transformer-based deep learning architectures--specifically, Vision Transformer (ViT), Class Attention in Image Transformers (CaiT), and Data-Efficient Image Transformers (DeiT)--for the binary classification of colorectal polyps using the Minimalist Histopathology Image Analysis Dataset (MHIST). The dataset comprises 3,152 hematoxylin and eosin (H&E)-stained Formalin Fixed Paraffin-Embedded (FFPE) images annotated as either Hyperplastic Polyps (HP) or Sessile Serrated Adenomas (SSA). A rigorous evaluation was conducted using a 5-fold stratified cross-validation methodology, and performance was quantified using metrics including accuracy, precision, recall, F1-score, and AUC-ROC. Experimental results revealed that transformer architectures, particularly CaiT (accuracy of 90.18%, AUC-ROC of 95.52%), outperformed traditional convolutional neural networks (CNNs). The superior performance of CaiT is attributed to its specialized class-attention mechanisms, effectively capturing nuanced morphological differences essential for accurate histopathological classification. These findings underscore the potential of transformer-based models to enhance diagnostic precision, reduce variability in pathological assessment, and facilitate earlier and more reliable colorectal cancer screening.

Vision Transformers Based AI Models For Predicting Colorectal Cancer from Digital Pathology WSI: Use Case Of MHIST dataset

Matching journals