March 18, 2024Open Access

Vision Transformer with 2D Explicit Position Encoding

Key Points

Key points are not available for this paper at this time.

Abstract

Recently, the Vision Transformer (ViT) has achieved outstanding performance in various computer vision tasks. Positional encoding is an indispensable component of ViT for handling the inherent structural information of images. However, attaching position encodings manually is a time-consuming process that slows down the training speed of ViT. To address this issue, we propose an explicit approach for positional encoding, distinct from the original ViT's implicit design. Our new implementation uses a 2D-based explicit positional encoding method that accelerates convergence and improves training efficiency. The proposed approach yields a remarkable improvement, especially in the initial stages of training, where the 2D explicit positional encoding offers improved compatibility with various input lengths and enhanced interpretability. The experimental results on the ImageNet dataset confirm the effectiveness of our proposed 2D explicit positional encoding approach. The proposed explicit 2D coordinate position encoding can achieve a maximum improvement of up to 437%.

Read Full Paperexternally

KI fragen

Bookmark

View Full Paper

Cite This Study

Li et al. (Mon,) studied this question.

synapsesocial.com/papers/68e7398bb6db6435876b2cc5 https://doi.org/https://doi.org/10.1109/icassp48485.2024.10446293

KI fragen

Bookmark

View Full Paper