Transformer models have achieved groundbreaking success in computer vision tasks, yet their deployment on resource-constrained edge devices remains challenging due to high computational complexity, memory demands, and hardware inefficiencies. This paper presents a holistic optimization framework to address these issues for real-time image processing in edge environments, particularly in autonomous driving systems. We propose a dynamic structured pruning method that adjusts model sparsity based on real-time scene complexity, combined with post-training quantization to compress model size while preserving accuracy. In addition, we co-design the algorithm with FPGA and SoC hardware platforms, leveraging custom sparse kernels, memory hierarchy optimization, and energy-efficient execution techniques. Evaluated on the KITTI and Cityscapes datasets, our method achieves a 55% reduction in inference latency with less than a 2% loss in accuracy, and improves energy efficiency by up to 3.1. Real-world tests confirm the robustness of the system under diverse operating conditions. This work offers a scalable and adaptable solution for deploying high-performance Transformer models in edge AI applications.
Xiaohua Tong (Wed,) studied this question.