Key points are not available for this paper at this time.
This paper aims to explore the possibility of utilizing vision transformers (ViTs) for on-edge medical diagnostics by experimenting with the Kvasir-Capsule image classification dataset, a large-scale image dataset of gastrointestinal diseases. Quantization techniques made available through TensorFlow Lite (TFLite), including post-training float-16 (F16) quantization and quantization-aware training (QAT), are applied to achieve reductions in model size, without compromising performance. The seven ViT models selected for this study are EfficientFormerV2S2, EfficientViTB0, EfficientViTM4, MobileViTV2₀50, MobileViTV2₁00, MobileViTV2₁75, and RepViTM11. Three metrics are considered when analyzing a model: (i) F1-score, (ii) model size, and (iii) performance-to-size ratio, where performance is the F1-score and size is the model size in megabytes (MB). In terms of F1-score, we show that MobileViTV2₁75 with F16 quantization outperforms all other models with an F1-score of 0. 9534. On the other hand, MobileViTV2₀50 trained using QAT was scaled down to a model size of 1. 70 MB, making it the smallest model amongst the variations this paper examined. MobileViTV2₀50 also achieved the highest performance-to-size ratio of 41. 25. Despite preferring smaller models for latency and memory concerns, medical diagnostics cannot afford poor-performing models. We conclude that MobileViTV2₁75 with F16 quantization is our best-performing model, with a small size of 27. 47 MB, providing a benchmark for lightweight models on the Kvasir-Capsule dataset.
Varam et al. (Tue,) studied this question.