Vision Transformer (ViT) with its attention mechanism in based on visual task performance, but its high computational complexity and memory requirements (such as ViT-base under the 224 x 224 input should be 17.6 GFLOPs, more than 2 GB of FP32 inference memory) limits its deployment on resource-constrained edge devices. In this paper, we propose a collaborative optimization framework that combines algorithm compression, hardware-aware acceleration, and compiler optimization, with a special focus on the possible breakthrough technologies in 2025 - MambaVision hybrid architecture and PH-Reg dynamic robustness enhancement. Through reliable optimization methods, the framework reduces PackQViT latency to 12.3 ms, achieves 62 img/s throughput of DynamicViT, and maintains or improves the accuracy over ViT-Base accuracy of 84.6% (e.g., PackQViT reaches 85.2%). In addition, challenges such as ultra-low-precision quantization generalization, dynamic architecture stability, cross-device collaboration, and the balance between privacy and energy efficiency are also explored.
Building similarity graph...
Analyzing shared references across papers
Loading...
Yifan Wu (Fri,) studied this question.
synapsesocial.com/papers/68bb46bd6d6d5674bccfebdf — DOI: https://doi.org/10.54097/b7d7w798
Yifan Wu
Scuola Superiore Sant'Anna
Journal of Computing and Electronic Information Management
Building similarity graph...
Analyzing shared references across papers
Loading...
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: