Key points are not available for this paper at this time.
Recently, the Vision Transformer (ViT) has achieved outstanding performance in various computer vision tasks. Positional encoding is an indispensable component of ViT for handling the inherent structural information of images. However, attaching position encodings manually is a time-consuming process that slows down the training speed of ViT. To address this issue, we propose an explicit approach for positional encoding, distinct from the original ViT's implicit design. Our new implementation uses a 2D-based explicit positional encoding method that accelerates convergence and improves training efficiency. The proposed approach yields a remarkable improvement, especially in the initial stages of training, where the 2D explicit positional encoding offers improved compatibility with various input lengths and enhanced interpretability. The experimental results on the ImageNet dataset confirm the effectiveness of our proposed 2D explicit positional encoding approach. The proposed explicit 2D coordinate position encoding can achieve a maximum improvement of up to 437%.
Li et al. (Mon,) studied this question.