Cross-Attention Enables Context-Aware Multimodal Skin Lesion Diagnosis
Mridha, K.; Islam, H.
Show abstract
Clinical diagnosis of skin lesions integrates visual dermoscopic features with patient context such as age, skin type, and lesion characteristics. However, most artificial intelligence systems for dermoscopic analysis rely solely on image data and ignore structured clinical metadata. We developed a multimodal deep learning framework that combines dermoscopic images with patient metadata and evaluated whether cross-attention mechanisms better capture contextual interactions than conventional fusion strategies. Using 1,568 lesions from the PAD-UFES-20 dataset (69% malignant) with associated metadata (age, sex, Fitzpatrick skin type, anatomical site, and lesion diameter), we compared four models: metadata-only logistic regression, image-only ResNet18, late fusion via feature concatenation, and cross-attention-based fusion. The image-only model achieved strong discrimination (AUC 0.9776), while late fusion slightly reduced performance (AUC 0.9717). The proposed cross-attention model achieved the best overall results (AUC 0.9818, AUPRC 0.9924) with improved calibration (ECE 0.0379). These findings suggest that attention-based multimodal learning enables more effective integration of patient context for automated skin lesion diagnosis.
Matching journals
The top 7 journals account for 50% of the predicted probability mass.