Hybrid Vision Transformers (HybridViTs), which integrate convolutional neural networks (CNNs) with Transformer blocks, offer both local and global feature extraction capabilities, achieving high performance across a range of computer vision tasks. However, the substantial computational asymmetry between lightweight CNN blocks and compute-intensive Transformer blocks presents significant challenges for simultaneous optimization and acceleration within a single hardware architecture. To address these challenges, we propose FLASH, a power-efficient field-programmable gate array (FPGA) -based accelerator tailored for CNN-Transformer hybrid networks. FLASH reduces quantization overhead by consolidating redundant quantization-dequantization operations into a single requantization step and enables 8-bit integer-only computation for residual connections through proper scaling factor handling. To further optimize for hardware efficiency, FLASH introduces hardware-friendly linear approximations of nonlinear functions such as Swish and Softmax. By precomputing row-wise max values through offline calibration, we eliminate both max-value search logic and intermediate memory buffering overhead, while reusing shared integer-exponential units to minimize resource consumption. Architecturally, FLASH employs a two-stage pipeline: Stage 1 eliminates external DRAM access using a fully pipelined MobileNetV2 backbone, while Stage 2 accelerates Transformer and convolutional components through specialized compute units and dataflow optimizations. Experimental evaluation using MobileViT (MViT) -xxs on Xilinx VCU118 FPGA demonstrates that FLASH incurs only a 0. 84% accuracy drop on ImageNet-1K compared to the FP32 baseline, while achieving up to 16. 8 lower power consumption and 26. 3 improvement in energy efficiency relative to CPU/GPU implementations. These results establish FLASH as an energy-efficient hardware accelerator for real-time inference of HybridViT models on edge devices.
Kim et al. (Thu,) studied this question.