Deploying Vision Transformers (ViTs) on edge devices poses significant challenges due to their high computational demands and memory access overheads, which severely hinder real-time inference efficiency. This paper proposes a modular and adaptive ViT acceleration architecture targeting the AMD Versal ACAP platform. By leveraging heterogeneous resource collaboration and fine-grained dataflow optimizations, the proposed design addresses performance bottlenecks effectively. We introduce a resource-efficient attention computation module that localizes self-attention operations within AI Engine (AIE) core clusters, thereby reducing inter-module communication and minimizing MAC resource usage. In parallel, a resource-aware multi-stage pipeline scheduling strategy dynamically partitions and parallelizes the computation-intensive feed-forward network (FFN), improving computation reuse and module-level coordination. The architecture integrates parameter tiling and a PLIO-based broadcasting mechanism to construct a decoupled compute-communication dataflow engine, alleviating memory bottlenecks. Experimental results on the Xilinx VCK5000 ACAP platform demonstrate that the proposed design achieves 33.2 TOPS throughput at INT8 precision—outperforming the state-of-the-art EQ-ViT accelerator by 27%—while maintaining a competitive efficiency of 510.6 GOPS/W. Scalability evaluations on ViT-Base and DeiT-Tiny confirm the design’s adaptability in edge scenarios, offering a resource-efficient and reconfigurable hardware paradigm for high-density Transformer inference.
Zhang et al. (Thu,) studied this question.