Key points are not available for this paper at this time.
Transformers have performed exceptionally well in natural language processing (NLP), prompting researchers to study their potential in computer vision. One such transformer is the Vision Transformer (ViT), which uses a pure transformer structure to classify images through a sequence of fixed-size patches. However, relying on the same patch embedding method for all images is considered an oversimplification of the process. In response to this limitation, we propose a super-pixel slicing scheme that dynamically generates a sequence of patch embeddings based on an image’s features. This sequence is then utilized to create a progressive positional encoding which can bring together slices of the same object. Our method can seamlessly integrate into the existing Transformer framework to form an end-to-end Vision Transformer with Superpixel Slicing (SSVT). Our empirical results show that SSVT significantly improves the performance of transformer-based models in image classification tasks.
Lu et al. (Sun,) studied this question.