What question did this study set out to answer?

This work aims to improve the performance of Vision Transformers on medium-scale datasets while preserving spatial and channel-wise structures.

June 26, 2026Open Access

Vision Transformer with Spatial 2D Multi-Channel Tokens

Puntos clave

This work aims to improve the performance of Vision Transformers on medium-scale datasets while preserving spatial and channel-wise structures.
Developed Token-Shared Convolutional Projection Vision Transformer (TSCP-ViT) model.
Integrated convolutional layers into the multi-head attention mechanism.
Introduced a decoder after each encoder and a trainable Non-Reversing Gate GELU (NRG-GELU) activation.
TSCP-ViT significantly outperformed ViT and CvT on CIFAR-100, Food-101, and ImageNet100.
Outperformed ResNet on Food-101 and approached ResNet performance on CIFAR-100 and ImageNet100.
Achieved these results with comparable parameter counts but considerably higher FLOPs.

Resumen

Vision Transformer (ViT) has been widely adopted in the computer vision community. However, the standard ViT often contains many parameters, usually performs poorly when trained from scratch on medium-scale datasets, and does not explicitly preserve the local spatial and channel-wise structures within each token. This work proposes a novel model called the Token-Shared Convolutional Projection Vision Transformer (TSCP-ViT). The core idea of TSCP-ViT is to integrate convolutional layers into the multi-head attention mechanism and to apply the same convolutional operation independently to each token, where each token exhibits spatial 2D multi-channel characteristics. In addition, this work introduces a Transformer decoder immediately after each Transformer encoder, enabling the classification tokens to aggregate information from all tokens and be updated using statistical information. Moreover, a trainable Non-Reversing Gate GELU (NRG-GELU) activation is also proposed. Comparative experiments on CIFAR-100, Food-101, and ImageNet100 show that, under comparable parameter counts and without pretraining or knowledge distillation, TSCP-ViT substantially surpasses ViT, outperforms CvT, outperforms ResNet on Food-101, and approaches ResNet on CIFAR-100 and ImageNet100, although with considerably higher FLOPs.

Leer artículo completoexternamente

Me gusta

Guardar

Ver artículo completo

Cite This Study

Zheng et al. (Tue,) studied this question.

synapsesocial.com/papers/6a3e17d3030ad1a9b30912e3 https://doi.org/https://doi.org/10.3390/electronics15132752

Me gusta

Guardar

Ver artículo completo