Key points are not available for this paper at this time.
In this study, we propose a general method for accelerating neural network inference on GPUs for embedded systems. Recently, the TensorRT is widely used for neural network inference on GPUs for embedded systems. However, as an efficient optimization method, a 8-bit quantization is not supported by TensorRT on a Nvidia Jetson Nano GPU. To address this, we proposed a acceleration method that involving quantizing weights and activations without TensorRT. Comparative experiments with TensorRT-optimized frame-works demonstrate that our method effectively accelerate the inference, while maintaing the inference accuracy.
Terakura et al. (Sun,) studied this question.