Los puntos clave no están disponibles para este artículo en este momento.
Recent years have seen an increasing trend in designing AI accelerators together with the rest of the system, including CPUs and memory hierarchy. This trend calls for high-quality simulators or analytical models that enable such kind of co-exploration. Currently, the majority of such exploration is supported by AI accelerator analytical models. But such models usually overlook the non-trivial impact of congestion of shared resources, non-ideal hardware utilization and non-zero CPU scheduler overhead, which could only be modeled by cycle-level simulators. However, most simulators with full-stack toolchains are proprietary to corporations, and the few open-source simulators are suffering from either weak compilers or limited space of modeling. This framework resolves these issues by proposing a compilation and simulation flow to run arbitrary Caffe neural network models on the NVIDIA Deep Learning Accelerator (NVDLA) with gem5, a cycle-level simulator, and by adding more building blocks including scratchpad allocation, multi-accelerator scheduling, tensor-level prefetching mechanisms, and a DMA-aided embedded buffer to map workload to multiple NVDLAs. The proposed framework has been tested and verified on a set of convolution neural networks, showcasing the capability of modeling complex buffer management strategies, scheduling policies, and hardware architectures. As a case study of this framework, we demonstrate the importance of adopting different buffering strategies for activation and weight tensors in AI accelerators to acquire remarkable speedup.
Lai et al. (Mon,) studied this question.