Vision Transformers (e.g., DeiT) have demonstrated exceptional performance in image classification tasks, yet their massive parameter counts heavily limit their deployment on resource-constrained edge devices. Focusing on the fine-grained image classification task (CIFAR-100), this paper investigates the structural redundancy within the multi-head attention mechanism of the DeiT-small model. We propose a static structured pruning method based on the L1 norm combined with a uniform layer-wise constraint. This approach evaluates the importance of the output projection weights of attention heads statically and removes redundant heads uniformly across each Transformer layer, effectively preventing the tensor dimension mismatch that occurs when all heads in a single layer are pruned. Experimental results indicate that removing 1 attention head per layer (16.7% globally) reduces the parameter count by 5.45% (down to 20.52M), while the post-finetuning accuracy reaches 86.12%. When the pruning ratio is scaled to 3 heads per layer (50.0% globally), the parameters are reduced by 16.34% (down to 18.16M), and the accuracy is maintained at 82.08%. This study successfully quantifies the redundancy boundary of attention heads in DeiT for fine-grained tasks, providing an empirical reference for model lightweighting.
Chen Siyu (Thu,) studied this question.