What type of study is this?

This is a Experimental Study study.

October 3, 2025Open Access

Efficient CNN Accelerator Based on Low-End FPGA with Optimized Depthwise Separable Convolutions and Squeeze-and-Excite Modules

Puntos clave

The proposed CNN accelerator improves processing efficiency on low-end FPGAs, optimizing depthwise separable convolutions, and squeeze-and-excite modules.
Optimizations result in at least a 1.47× performance improvement compared to ARM CPUs and over 90% savings in Digital Signal Processors.
The accelerator features configurable parameters, allowing for adaptable hardware resource consumption and computational speed to match different applications.
Reduced data latency is achieved by minimizing reliance on internal caches, leading to better overall processing efficiency.

Resumen

With the rapid development of artificial intelligence technology in the field of intelligent manufacturing, convolutional neural networks (CNNs) have shown excellent performance and generalization capabilities in industrial applications. However, the huge computational and resource requirements of CNNs have brought great obstacles to their deployment on low-end hardware platforms. To address this issue, this paper proposes a scalable CNN accelerator that can operate on low-performance Field-Programmable Gate Arrays (FPGAs), which is aimed at tackling the challenge of efficiently running complex neural network models on resource-constrained hardware platforms. This study specifically optimizes depthwise separable convolution and the squeeze-and-excite module to improve their computational efficiency. The proposed accelerator allows for the flexible adjustment of hardware resource consumption and computational speed through configurable parameters, making it adaptable to FPGAs with varying performance and different application requirements. By fully exploiting the characteristics of depthwise separable convolution, the accelerator optimizes the convolution computation process, enabling flexible and independent module stackings at different stages of computation. This results in an optimized balance between hardware resource consumption and computation time. Compared to ARM CPUs, the proposed approach yields at least a 1.47× performance improvement, and compared to other FPGA solutions, it saves over 90% of Digital Signal Processors (DSPs). Additionally, the optimized computational flow significantly reduces the accelerator’s reliance on internal caches, minimizing data latency and further improving overall processing efficiency.

Leer artículo completoexternamente

Preguntar a la IA

Me gusta

Guardar

Ver artículo completo