What question did this study set out to answer?

The central aim is to evaluate the reliability of attention weights for token pruning in vision transformers (ViTs) across various layers and conditions.

May 18, 2026Open Access

Rethinking attention reliability for token pruning in vision transformers

Key Points

The central aim is to evaluate the reliability of attention weights for token pruning in vision transformers (ViTs) across various layers and conditions.
Systematic analysis of attention selectivity using metrics for concentration and stability.
Examination of non-attention importance scores derived from token embeddings across ViT variants.
Investigation of static routing configurations for token pruning without retraining.
Attention scoring is unreliable for early-stage pruning, with performance improving in deeper layers.
Non-attention scores provide more stable pruning behavior in shallow layers compared to attention-based methods.
Pruned patterns achieve a balance between accuracy and efficiency, matching or surpassing existing token reduction methods.

Abstract

Vision Transformers (ViTs) achieve strong performance in visual recognition but incur quadratic computational cost with respect to the number of tokens, motivating extensive research on token pruning and reduction. Most existing pruning methods estimate token importance directly from attention weights, implicitly assuming that attention magnitude provides a reliable proxy for semantic relevance across all layers. Our analysis shows that the validity of this assumption varies substantially with transformer depth and model scale, and can also depend on the training paradigm. Through a systematic analysis of attention selectivity using multiple concentration and stability measures, attention distributions in shallow layers tend to be highly diffuse and weakly discriminative, making attention-based scoring unreliable for early-stage pruning. In contrast, attention becomes increasingly informative in deeper layers as token representations mature. Motivated by this observation, a broad range of non-attention importance scores derived from token embeddings, including statistics- and similarity-based criteria, is examined. Across ViT variants and diverse training settings, these non-attention scores exhibit more stable pruning behavior in shallow layers, whereas attention-based scoring becomes effective only after sufficient representational discrimination is achieved. Importantly, the depth at which this transition occurs is model-dependent and not strictly monotonic, indicating that uniform attention-based pruning is fundamentally mismatched to the representational dynamics of ViTs. Based on these findings, token pruning is formulated as a layer-wise selection problem governed by the reliability of attention, and lightweight static routing configurations are investigated without retraining or dynamic inference control. For equivalent FLOPs, the resulting pruning patterns achieve a trade-off between accuracy and efficiency that is comparable to or superior to that of representative token reduction methods. Overall, these results establish token importance estimation in ViTs as an inherently layer-dependent problem shaped by representation maturity, model characteristics, and training paradigm rather than uniform attention magnitude.

Read Full Paperexternally

Ask AI

Helpful

Bookmark

View Full Paper