Key points are not available for this paper at this time.
Abstract The vision transformer (ViT), pre-trained on large datasets, outperforms convolutional neural networks (CNN) in computer vision (CV). However, if not pre-trained, the transformer architecture doesn’t work well on small datasets and is surpassed by CNN. Through analysis, we found that: (1) the division and processing of tokens in the ViT discard the marginalized information between token. (2) the isolated multi-head self-attention (MSA) lacks prior knowledge. (3) the local inductive bias capability of stacked transformer block is much inferior to that of CNN. We propose a novel architecture for small data paradigms without pre-training, named Add-Vit, which uses progressive tokenization with feature supplementation in patch embedding. The model’s representational ability is enhanced by using a convolutional prediction module shortcut to connect MSA and capture local features as additional representations of the token. Without the need for pre-training on large datasets, our best model achieved 81. 25 \% % accuracy when trained from scratch on the CIFAR-100.
Chen et al. (Fri,) studied this question.