Vision Transformer (ViT) with its attention mechanism in based on visual task performance, but its high computational complexity and memory requirements (such as ViT-base under the 224 x 224 input should be 17.6 GFLOPs, more than 2 GB of FP32 inference memory) limits its deployment on resource-constrained edge devices. In this paper, we propose a collaborative optimization framework that combines algorithm compression, hardware-aware acceleration, and compiler optimization, with a special focus on the possible breakthrough technologies in 2025 - MambaVision hybrid architecture and PH-Reg dynamic robustness enhancement. Through reliable optimization methods, the framework reduces PackQViT latency to 12.3 ms, achieves 62 img/s throughput of DynamicViT, and maintains or improves the accuracy over ViT-Base accuracy of 84.6% (e.g., PackQViT reaches 85.2%). In addition, challenges such as ultra-low-precision quantization generalization, dynamic architecture stability, cross-device collaboration, and the balance between privacy and energy efficiency are also explored.
Yifan Wu (Fri,) studied this question.