Abstract The Vision Transformer (ViT) model has emerged as a powerful architecture for visual tasks by enabling the capture of long-range dependencies within images, demonstrating superior performance across a variety of applications. However, the large parameter count, along with high computational and memory demands of ViTs pose significant challenges. This paper introduces ViT-CAAC (Contribution-Aware Adaptive Compression Framework), a novel, multi-faceted compression framework designed to optimize ViTs. Our framework integrates block-level knowledge distillation, layer-wise quantization with precision control across hierarchical layers, and adaptive sparsity, creating a cohesive approach that substantially reduces model size while preserving performance. Through rigorous experimentation on benchmark datasets, we demonstrate that our framework achieves over 76% reduction in model size with minimal accuracy degradation (less than 0.4% Top-1 accuracy loss). This work establishes a novel concept for deploying high-performance vision models on resource-limited devices, with implications for applications in autonomous systems, IoT, and real-time vision processing.
Building similarity graph...
Analyzing shared references across papers
Loading...
Yu Zhang
Shanxi Agricultural University
Suping Peng
China University of Mining and Technology
Yao Xiao
Guangzhou University of Chinese Medicine
Tsinghua University
Beijing Academy of Artificial Intelligence
Shanghai Artificial Intelligence Laboratory
Building similarity graph...
Analyzing shared references across papers
Loading...
Zhang et al. (Fri,) studied this question.
synapsesocial.com/papers/68dc1e358a7d58c25ebb1988 — DOI: https://doi.org/10.21203/rs.3.rs-7464053/v1