Vision-Language Foundation Models Do Not Transfer to Medical Imaging Classification: A Negative Result on Chest X-ray Diagnosis

Fisher, G. R.

2025-12-08 radiology and imaging

10.64898/2025.12.06.25341759 medRxiv

Show abstract

Vision-language models (VLMs) pretrained on web-scale data have achieved remarkable performance across diverse tasks, leading to widespread adoption in industry. A natural question is whether these powerful representations transfer to specialized medical imaging domains, and whether domain-specific medical pretraining improves transfer. We tested these hypotheses using two VLMs on the NIH ChestX-ray14 benchmark: Qwen2.5-VL (pretrained on web data) and BiomedCLIP (pretrained on 15 million PubMed biomedical image-text pairs). Both models dramatically underperformed compared to convolutional neural networks (CNNs) with ImageNet pretraining. Across 5 random seeds, the best VLM achieved F1=0.196 {+/-} 0.004 versus a CNN baseline of F1=0.811. Domain-specific pretraining provided marginal improvement: BiomedCLIPs frozen encoder achieved F1=0.161 {+/-} 0.001 versus Qwens F1=0.124 (+30%), but this remains clinically inadequate. Fine-tuning both models led to catastrophic overfitting, with sensitivity collapsing from >65% to <36% as the models learned to predict "no disease" for all inputs. These results demonstrate that neither general-purpose nor medical-specific vision-language pretraining produces features suitable for dense multi-label medical imaging classification. For chest X-ray diagnosis, traditional CNNs with ImageNet pretraining remain substantially more effective than VLM-based approaches.

Vision-Language Foundation Models Do Not Transfer to Medical Imaging Classification: A Negative Result on Chest X-ray Diagnosis

Matching journals