Key points are not available for this paper at this time.
Vision Transformers (ViTs) achieve strong performance in visual recognition but incur quadratic computational cost with respect to the number of tokens, motivating extensive research on token pruning and reduction. Most existing pruning methods estimate token importance directly from attention weights, implicitly assuming that attention magnitude provides a reliable proxy for semantic relevance across all layers. Our analysis shows that the validity of this assumption varies substantially with transformer depth and model scale, and can also depend on the training paradigm. Through a systematic analysis of attention selectivity using multiple concentration and stability measures, attention distributions in shallow layers tend to be highly diffuse and weakly discriminative, making attention-based scoring unreliable for early-stage pruning. In contrast, attention becomes increasingly informative in deeper layers as token representations mature. Motivated by this observation, a broad range of non-attention importance scores derived from token embeddings, including statistics- and similarity-based criteria, is examined. Across ViT variants and diverse training settings, these non-attention scores exhibit more stable pruning behavior in shallow layers, whereas attention-based scoring becomes effective only after sufficient representational discrimination is achieved. Importantly, the depth at which this transition occurs is model-dependent and not strictly monotonic, indicating that uniform attention-based pruning is fundamentally mismatched to the representational dynamics of ViTs. Based on these findings, token pruning is formulated as a layer-wise selection problem governed by the reliability of attention, and lightweight static routing configurations are investigated without retraining or dynamic inference control. For equivalent FLOPs, the resulting pruning patterns achieve a trade-off between accuracy and efficiency that is comparable to or superior to that of representative token reduction methods. Overall, these results establish token importance estimation in ViTs as an inherently layer-dependent problem shaped by representation maturity, model characteristics, and training paradigm rather than uniform attention magnitude.
Building similarity graph...
Analyzing shared references across papers
Loading...
Ryuto Ishibashi
Hayata Kaneko
Lin Meng
Neurocomputing
Ritsumeikan University
Building similarity graph...
Analyzing shared references across papers
Loading...
Ishibashi et al. (Fri,) studied this question.
www.synapsesocial.com/papers/6a0aac6d5ba8ef6d83b6fd86 — DOI: https://doi.org/10.1016/j.neucom.2026.133976