Key points are not available for this paper at this time.
Abstract The Vision Transformer (ViT) architecture utilized the self attention and transformer architecture originally designed for natural lan guage processing (NLP) enables ViTs to capture global relationships and long-range dependencies within images. The purpose of our study was to compare the performance of embeddings generated by Convolutional Neural Network (CNN) and Vision Transformers (ViT) on a novel domain-specific task which was not presented at any point to the models prior to the process of the embeddings generation. The pretrained CNN model was MobileNetV2, and the pretrained vision transformer was ViT-B16. The accuracy and F score obtained from the embeddings generated by MobileNetV2 were 0.64 and 0.69 respectively. The accuracy and F-score obtained from the embed dings generated by ViT-B16 were 0.81 and 0.79 respectively. Our study suggests that ViT might perform better in unseen domain specific problems which were not presented in the pretraining. ViT utilizing the self-attention mechanisms capture rich and generic visual representations that might gen eralize well to unseen domain specific problems.
Nuriel Sahlom Mor (Thu,) studied this question.