Abstract The Vision Transformer (ViT) model has emerged as a powerful architecture for visual tasks by enabling the capture of long-range dependencies within images, demonstrating superior performance across a variety of applications. However, the large parameter count, along with high computational and memory demands of ViTs pose significant challenges. This paper introduces ViT-CAAC (Contribution-Aware Adaptive Compression Framework), a novel, multi-faceted compression framework designed to optimize ViTs. Our framework integrates block-level knowledge distillation, layer-wise quantization with precision control across hierarchical layers, and adaptive sparsity, creating a cohesive approach that substantially reduces model size while preserving performance. Through rigorous experimentation on benchmark datasets, we demonstrate that our framework achieves over 76% reduction in model size with minimal accuracy degradation (less than 0.4% Top-1 accuracy loss). This work establishes a novel concept for deploying high-performance vision models on resource-limited devices, with implications for applications in autonomous systems, IoT, and real-time vision processing.
Zhang et al. (Fri,) studied this question.