May 30, 2024Open Access

A Comparative Analysis of Convolutional Neural Network and Vision Transformer Embeddings on a Novel Domain-Specific Task

Key Points

Key points are not available for this paper at this time.

Abstract

Abstract The Vision Transformer (ViT) architecture utilized the self attention and transformer architecture originally designed for natural lan guage processing (NLP) enables ViTs to capture global relationships and long-range dependencies within images. The purpose of our study was to compare the performance of embeddings generated by Convolutional Neural Network (CNN) and Vision Transformers (ViT) on a novel domain-specific task which was not presented at any point to the models prior to the process of the embeddings generation. The pretrained CNN model was MobileNetV2, and the pretrained vision transformer was ViT-B16. The accuracy and F score obtained from the embeddings generated by MobileNetV2 were 0.64 and 0.69 respectively. The accuracy and F-score obtained from the embed dings generated by ViT-B16 were 0.81 and 0.79 respectively. Our study suggests that ViT might perform better in unseen domain specific problems which were not presented in the pretraining. ViT utilizing the self-attention mechanisms capture rich and generic visual representations that might gen eralize well to unseen domain specific problems.

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Nuriel Sahlom Mor (Thu,) studied this question.

synapsesocial.com/papers/68e67a9ab6db6435876049ff https://doi.org/https://doi.org/10.21203/rs.3.rs-4496133/v1

Bookmark

View Full Paper