Key points are not available for this paper at this time.
Vision Transformer (ViT) fully demonstrates the potential of the transformer architecture in the field of computer vision. However, the computational complexity is proportional to the length of the input sequence, thus limiting the application of transformers to high-resolution images. In order to improve the overall performance of Vision Transformer, this paper proposes an efficient Vision Transformer (MLVT) with dynamic embedding of multi-scale features, adopting the pyramid architecture, replacing the self-attention operation with linear self-attention, proposing a local attention enhancement module to address the problem of the dispersal of linear self-attention scores that ignores local correlation, and supplementing the local attention with the convolution of the self-attention-like computation. operation of self-attention-like computation is utilized to supplement the local attention. Aiming at the increase of feature dimension in pyramid architecture, the bottleneck of linear self-attention computation is changed from sequence length to feature dimension, and the linear self-attention with compressed feature dimension is proposed. In addition, since multi-scale inputs are crucial for processing image information, this paper proposes a flexible and learnable dynamic multi-scale feature embedding module, which dynamically adjusts the weights of different scale features according to the input image for fusion. A large number of experiments on image classification and target detection tasks show that competitive results are achieved while reducing the computational effort.
Building similarity graph...
Analyzing shared references across papers
Loading...
Mingrui Zhang
Ronggui Wang
Juan Yang
Hefei University of Technology
Building similarity graph...
Analyzing shared references across papers
Loading...
Zhang et al. (Fri,) studied this question.
www.synapsesocial.com/papers/68e5bfacb6db643587557866 — DOI: https://doi.org/10.1117/12.3035520