Vision Transformer (ViT) with its attention mechanism in based on visual task performance, but its high computational complexity and memory requirements (such as ViT-base under the 224 x 224 input should be 17.6 GFLOPs, more than 2 GB of FP32 inference memory) limits its deployment on resource-constrained edge devices. In this paper, we propose a collaborative optimization framework that combines algorithm compression, hardware-aware acceleration, and compiler optimization, with a special focus on the possible breakthrough technologies in 2025 - MambaVision hybrid architecture and PH-Reg dynamic robustness enhancement. Through reliable optimization methods, the framework reduces PackQViT latency to 12.3 ms, achieves 62 img/s throughput of DynamicViT, and maintains or improves the accuracy over ViT-Base accuracy of 84.6% (e.g., PackQViT reaches 85.2%). In addition, challenges such as ultra-low-precision quantization generalization, dynamic architecture stability, cross-device collaboration, and the balance between privacy and energy efficiency are also explored.
Building similarity graph...
Analyzing shared references across papers
Loading...
Yifan Wu
Scuola Superiore Sant'Anna
Journal of Computing and Electronic Information Management
Building similarity graph...
Analyzing shared references across papers
Loading...
Yifan Wu (Fri,) studied this question.
synapsesocial.com/papers/68bb46bd6d6d5674bccfebdf — DOI: https://doi.org/10.54097/b7d7w798
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: