Recently, Vision Transformers (ViTs) have gained increasing attention. Research shows the ViT models perform well on large datasets. Yet, high computation limits their deployment on low-resource devices. Investigations have revealed that, due to the sparse nature of attention, only a subset of tokens contribute to the final prediction. Consequently, some existing ViT compression methods aim to alleviate the model burden by eliminating unimportant tokens. Nevertheless, optimizing the model solely through token sparsification is insufficient. And the excessively high token sparsity ratio may lead to considerable accuracy degradation. Moreover, the computation of the attention head module and the feed-forward network (FFN) module in each layer of the model still accounts for a considerable proportion of the overall computational load. In response to the aforementioned challenges, this paper presents a novel pruning method that operates across three dimensions: token, attention head, and FFN neuron. Initially, an attention-gradient importance analysis method is introduced to quantify the importance scores of tokens. Subsequently, fusion pruning is executed based on these token scores. Concurrently, a multi-dimensional optimization searcher is proposed. It identifies the optimal pruning strategies for attention head and FFN neuron across different layers. This enables the pruning rate of each layer is different, and it facilitates hierarchical pruning. By integrating the pruning efforts across these three dimensions, this paper achieves efficient model compression while minimizing the degradation of model performance. Experimental results demonstrate that our method significantly reduces the computational cost of the ViT model. Specifically, for Deit-small, it achieves a nearly 50 \% reduction in FLOPs with only 1. 2 \% accuracy loss. For Deit-base, it achieves a nearly 44 \% reduction in FLOPs with just a 0. 9 \% accuracy loss.
Liu et al. (Tue,) studied this question.