What question did this study set out to answer?

The aim is to optimize Vision Transformers by using a multi-dimensional approach to pruning, targeting tokens, attention heads, and FFN neurons.

June 4, 2026Open Access

Multi-dimensional collaborative pruning: optimizing vision transformers via token, attention head and FFN neuron

Key Points

The aim is to optimize Vision Transformers by using a multi-dimensional approach to pruning, targeting tokens, attention heads, and FFN neurons.
Introduced an attention-gradient importance analysis method to assess token importance scores.
Executed fusion pruning based on determined token scores across layers.
Proposed a multi-dimensional optimization searcher for pruning strategies of attention heads and FFN neurons.
Achieved nearly 50% reduction in FLOPs for Deit-small with only 1.2% accuracy loss.
Achieved nearly 44% reduction in FLOPs for Deit-base with just 0.9% accuracy loss.

Abstract

Recently, Vision Transformers (ViTs) have gained increasing attention. Research shows the ViT models perform well on large datasets. Yet, high computation limits their deployment on low-resource devices. Investigations have revealed that, due to the sparse nature of attention, only a subset of tokens contribute to the final prediction. Consequently, some existing ViT compression methods aim to alleviate the model burden by eliminating unimportant tokens. Nevertheless, optimizing the model solely through token sparsification is insufficient. And the excessively high token sparsity ratio may lead to considerable accuracy degradation. Moreover, the computation of the attention head module and the feed-forward network (FFN) module in each layer of the model still accounts for a considerable proportion of the overall computational load. In response to the aforementioned challenges, this paper presents a novel pruning method that operates across three dimensions: token, attention head, and FFN neuron. Initially, an attention-gradient importance analysis method is introduced to quantify the importance scores of tokens. Subsequently, fusion pruning is executed based on these token scores. Concurrently, a multi-dimensional optimization searcher is proposed. It identifies the optimal pruning strategies for attention head and FFN neuron across different layers. This enables the pruning rate of each layer is different, and it facilitates hierarchical pruning. By integrating the pruning efforts across these three dimensions, this paper achieves efficient model compression while minimizing the degradation of model performance. Experimental results demonstrate that our method significantly reduces the computational cost of the ViT model. Specifically, for Deit-small, it achieves a nearly 50 \% reduction in FLOPs with only 1. 2 \% accuracy loss. For Deit-base, it achieves a nearly 44 \% reduction in FLOPs with just a 0. 9 \% accuracy loss.

Bookmark

View Full Paper

Bookmark

View Full Paper

Multi-dimensional collaborative pruning: optimizing vision transformers via token, attention head and FFN neuron

Key Points

Abstract

Cite This Study