Key points are not available for this paper at this time.
Deep learning (DL) has dramatically evolved and become one of the most successful machine learning techniques. A variety of DL-enabled applications have been widely integrated into software systems, including embedded ones. Although having achieved very successful results in accuracy, the large size of deep neural networks could require significant runtime and computing resource consumption. To overcome these drawbacks, TensorRT has been developed and may be incorporated into popular DL frameworks such as PyTorch and Open Neural Network Exchange (ONNX). In this paper, focusing on inference, we provide a comprehensive evaluation on the performances of TensorRT. Specifically, we evaluate inference output validation, inference time, inference throughput, and GPU memory usage. Our results demonstrate that TensorRT is able to significantly improve the inference efficiency metrics without compromising inference accuracy. Furthermore, TensorRT may be adopted via several alternative workflows. Our evaluation also shows the pros and cons of these TensorRT workflows. We analyze that for each workflow and discuss the workflow selection for different application scenarios.
Zhou et al. (Thu,) studied this question.