Vision Transformer (ViT) has been widely adopted in the computer vision community. However, the standard ViT often contains many parameters, usually performs poorly when trained from scratch on medium-scale datasets, and does not explicitly preserve the local spatial and channel-wise structures within each token. This work proposes a novel model called the Token-Shared Convolutional Projection Vision Transformer (TSCP-ViT). The core idea of TSCP-ViT is to integrate convolutional layers into the multi-head attention mechanism and to apply the same convolutional operation independently to each token, where each token exhibits spatial 2D multi-channel characteristics. In addition, this work introduces a Transformer decoder immediately after each Transformer encoder, enabling the classification tokens to aggregate information from all tokens and be updated using statistical information. Moreover, a trainable Non-Reversing Gate GELU (NRG-GELU) activation is also proposed. Comparative experiments on CIFAR-100, Food-101, and ImageNet100 show that, under comparable parameter counts and without pretraining or knowledge distillation, TSCP-ViT substantially surpasses ViT, outperforms CvT, outperforms ResNet on Food-101, and approaches ResNet on CIFAR-100 and ImageNet100, although with considerably higher FLOPs.
Zheng et al. (Tue,) studied this question.