Key points are not available for this paper at this time.
Vision Transformers (ViT) have achieved remarkable success in large-scale recognition. They split every 2D image into a fixed number of patches, of which is treated as a token. Generally, representing an image with more would lead to higher prediction accuracy, while it also results in increased computational cost. To achieve a decent trade-off between and speed, the number of tokens is empirically set to 16x16 or 14x14. this paper, we argue that every image has its own characteristics, and the token number should be conditioned on each individual input. In, we have observed that there exist a considerable number of "easy" images can be accurately predicted with a mere number of 4x4 tokens, while only small fraction of "hard" ones need a finer representation. Inspired by this, we propose a Dynamic Transformer to automatically configure a number of tokens for each input image. This is achieved by cascading Transformers with increasing numbers of tokens, which are sequentially in an adaptive fashion at test time, i. e. , the inference is once a sufficiently confident prediction is produced. We further efficient feature reuse and relationship reuse mechanisms across components of the Dynamic Transformer to reduce redundant. Extensive empirical results on ImageNet, CIFAR-10, and CIFAR-100 that our method significantly outperforms the competitive baselines terms of both theoretical computational efficiency and practical inference. Code and pre-trained models (based on PyTorch and MindSpore) are at https: //github. com/blackfeather-wang/Dynamic-Vision-Transformer https: //github. com/blackfeather-wang/Dynamic-Vision-Transformer-MindSpore.
Wang et al. (Mon,) studied this question.