TumorCLIP: Lightweight Vision-Language Fusion for Explainable MRI-Based Brain Tumor Classification

Jia, Y.; Niu, J.; Qie, Z.; Li, Z.; Laine, A. F.; Guo, J.

2026-03-13 radiology and imaging

10.64898/2026.03.11.26348155 medRxiv

Show abstract

Accurate classification of brain tumors from MRI is critical for guiding clinical decision-making; however, existing deep learning models are often hindered by limited interpretability and pronounced sensitivity to hyperparameter selection, which constrain their reliability in medical settings. To address these challenges, we propose TumorCLIP, a lightweight and training-efficient vision-language framework that integrates radiology-informed text prototypes with a DenseNet-based visual encoder to support clinically meaningful semantic reasoning, fused via a Tip-Adapter mechanism. TumorCLIP does not aim to introduce a new vision-language model architecture. Instead, its contribution lies in the integration of radiology-informed text proto-types tailored to MRI interpretation, a systematic evaluation of backbone stability across diverse visual architectures, and a lightweight, training-efficient CLIP-based fusion framework designed for medical imaging applications. We first conduct a comprehensive unimodal benchmark across eight representative visual backbones (EfficientNet-B0, MobileNetV3-Large, ResNet50, DenseNet121, ViT, DeiT, Swin Transformer, and MambaOut) using a standardized optimizer and learning-rate grid search, revealing performance swings exceeding 60 percentage points depending on hyperparameter choices. DenseNet121 shows the strongest stability-accuracy trade-off within our evaluated optimizer and learning-rate grid (97.6%). Leveraging this foundation, TumorCLIP fuses image features with frozen CLIP-derived text prototypes, achieving concept-level explainability, robust few-shot adaptation, and enhanced classification of minority tumor classes. On the test set, TumorCLIP attains 98.5% accuracy, including a +1.86 percentage point recall increase for Neurocytoma, suggesting that radiology-informed textual priors can improve semantic alignment and help refine diagnostic decision boundaries within the evaluated setting. Additional evaluation on an independent external dataset shows that TumorCLIP achieves improved cross-dataset performance under the evaluated distribution shift, relative to the unimodal DenseNet121 baseline. These results demonstrate TumorCLIP as a practical, interpretable, and data-efficient alternative to conventional visual classifiers, providing evidence for radiology-aware vision-language alignment in MRI-based brain tumor classification. All results are reported within the evaluated datasets and training protocols.

TumorCLIP: Lightweight Vision-Language Fusion for Explainable MRI-Based Brain Tumor Classification

Matching journals