Transformers have achieved compelling performance in vision tasks, but their substantial computational overhead remains a major obstacle to efficient real-time visual perception, especially on resource-constrained platforms such as unmanned aerial vehicles and embedded systems. Existing acceleration strategies often focus on reducing attention complexity or incorporating lightweight convolutional operations, while relatively less attention has been paid to how stage configuration, early embedding design, and potential late-stage attention redundancy jointly affect the efficiency of compact Vision Transformers. To address this issue, we empirically compare two common architectural configurations and investigate whether a three-stage design can improve inference efficiency while preserving representation quality. Based on this analysis, we propose Multi-Level Patch Embedding (MLPE), a progressive early embedding structure that goes beyond simple patchification and conventional convolutional stems. MLPE decomposes the stem-to-token transformation into preliminary downsampling, intermediate local feature refinement, and final compact embedding, helping retain and refine fine-grained local structures before generating compact token representations. Our empirical results further suggest that, within the evaluated compact architectures, allocating greater depth and concentrating attention layers in the final stage tends to improve the accuracy-efficiency balance. To reduce the resulting attention overhead, we propose Squeeze Multi-Head Self-Attention (SMSA), which learns a compact representative attention response in a squeezed feature space and fuses the re-scaled attention output with the input feature for feature recovery. Based on these designs, SFViT achieves a favorable accuracy-efficiency trade-off on ImageNet-1K and COCO, suggesting its practical potential for efficient real-time visual perception.
Wu et al. (Mon,) studied this question.