What question did this study set out to answer?

The goal is to create an energy-efficient FPGA accelerator for CNN-Transformer hybrid networks, optimizing performance and power consumption.

April 4, 2026

FLASH: Energy-Efficient FPGA Acceleration via Linear Approximation and Streamlined Two-Stage Pipeline Architectures for Quantized CNN–Transformer Hybrid Networks

Key Points

The goal is to create an energy-efficient FPGA accelerator for CNN-Transformer hybrid networks, optimizing performance and power consumption.
Developed FLASH, an FPGA-based hardware accelerator.
Consolidated quantization-dequantization into a single requantization step.
Implemented hardware-friendly linear approximations for nonlinear functions.
Adopted a two-stage pipeline architecture to eliminate external DRAM access.
Achieved only a 0.84% accuracy drop on ImageNet-1K compared to FP32 baseline.
Realized up to 16.8 times lower power consumption than traditional CPU/GPU.
Obtained 26.3 times improvement in energy efficiency for HybridViT models.

Abstract

Hybrid Vision Transformers (HybridViTs), which integrate convolutional neural networks (CNNs) with Transformer blocks, offer both local and global feature extraction capabilities, achieving high performance across a range of computer vision tasks. However, the substantial computational asymmetry between lightweight CNN blocks and compute-intensive Transformer blocks presents significant challenges for simultaneous optimization and acceleration within a single hardware architecture. To address these challenges, we propose FLASH, a power-efficient field-programmable gate array (FPGA) -based accelerator tailored for CNN-Transformer hybrid networks. FLASH reduces quantization overhead by consolidating redundant quantization-dequantization operations into a single requantization step and enables 8-bit integer-only computation for residual connections through proper scaling factor handling. To further optimize for hardware efficiency, FLASH introduces hardware-friendly linear approximations of nonlinear functions such as Swish and Softmax. By precomputing row-wise max values through offline calibration, we eliminate both max-value search logic and intermediate memory buffering overhead, while reusing shared integer-exponential units to minimize resource consumption. Architecturally, FLASH employs a two-stage pipeline: Stage 1 eliminates external DRAM access using a fully pipelined MobileNetV2 backbone, while Stage 2 accelerates Transformer and convolutional components through specialized compute units and dataflow optimizations. Experimental evaluation using MobileViT (MViT) -xxs on Xilinx VCU118 FPGA demonstrates that FLASH incurs only a 0. 84% accuracy drop on ImageNet-1K compared to the FP32 baseline, while achieving up to 16. 8 lower power consumption and 26. 3 improvement in energy efficiency relative to CPU/GPU implementations. These results establish FLASH as an energy-efficient hardware accelerator for real-time inference of HybridViT models on edge devices.

Bookmark

FLASH: Energy-Efficient FPGA Acceleration via Linear Approximation and Streamlined Two-Stage Pipeline Architectures for Quantized CNN–Transformer Hybrid Networks

Key Points

Abstract

Cite This Study