Back

Pretraining Objective Shapes Cross-Category Generalization in Affective Image Prediction: A Geometric Comparison of Vision Transformer Encoders

Tsuchimoto, S.; Okazaki, Y. O.; Yuasa, K.; Nishijima, S.; Izumiya, M.; Hagihara, M.; Fujihira, R.; Kitajo, K.

2026-05-13 neuroscience
10.64898/2026.05.11.724194 bioRxiv
Show abstract

The geometry of representations learned by deep neural networks is shaped jointly by architecture and pretraining objective, yet disentangling these two factors remains difficult. Here we isolate the contribution of pretraining objective by comparing two Vision Transformers from the same backbone family but trained under different objectives: language-image contrastive learning (CLIP) and ImageNet-21k classification. Using continuous Valence-Arousal prediction on the OASIS dataset as a probe of representational quality, we evaluated frozen features under Leave-One-Theme-Out and Leave-One-Category-Out cross-validation, the latter requiring extrapolation to entirely unseen semantic categories. The contrastively pretrained encoder generalized substantially better than the classification-pretrained encoder under both protocols, with the gap widening sharply when held-out categories required cross-category generalization. To characterize why the two representations differ, we developed a geometric analysis of prediction errors, treating per-image errors as vectors in the affective plane and quantifying their spatial structure via weighted phase-locking, trajectory-based occupancy entropy, and effective dimensionality. The classification-pretrained representation collapsed errors into a small number of attractor regions with a strong center-ward pull, whereas the language-aligned representation distributed errors broadly across the affective space. Layer-wise linear probing further revealed that affective information was distributed across depth in the contrastive encoder but increasingly concentrated in deeper layers of the classification encoder, mirroring the texture-bias and category-anchored statistics characteristic of ImageNet-trained representations. These results provide a representation-geometric account of how the choice of pretraining objective, holding architecture constant, determines whether learned features generalize across semantic boundaries or remain confined to category-bound visual regularities. HighlightsO_LIIsolate the effect of pretraining objective by holding the Vision Transformer backbone constant. C_LIO_LIContrastively pretrained features generalize across unseen semantic categories where classification-pretrained features fail. C_LIO_LIIntroduce a geometric analysis of prediction errors based on phase-locking and occupancy entropy. C_LIO_LIClassification pretraining produces concentrated error attractors and a rigid centerward bias. C_LIO_LIAffective information is distributed across depth in CLIP but localized in late layers of the classification ViT. C_LI

Matching journals

The top 4 journals account for 50% of the predicted probability mass.