Leveraging the ability of Vision Transformers (ViTs) to model contextual information across spatial patches, Masked Image Modeling (MIM) has emerged as a successful pre-training paradigm for visual representation learning by masking parts of the input and reconstructing the original image. However, this characteristic of ViTs has led many existing MIM methods to focus primarily on spatial patch reconstruction, overlooking the importance of semantic continuity in the channel dimension. Therefore, we propose a novel Masked Channel Modeling (MCM) pre-training paradigm, which reconstructs masked channel features using the contextual information from unmasked channels, thereby enhancing the model’s understanding of images from the perspective of channel semantic continuity. Considering that traditional RGB reconstruction targets lack sufficient semantic attributes in the channel dimension, MCM introduces advanced features extracted by the CLIP image encoder as reconstruction targets. This guides the model to better capture semantic continuity across feature channels. Extensive experiments on downstream tasks, including image classification, object detection, and semantic segmentation, demonstrate the effectiveness and superiority of MCM. Our code will be available later.
Chen et al. (Fri,) studied this question.