Key points are not available for this paper at this time.
Although the core operations in various AI models can be formulated as matrix multiplication (MM), their characteristics are quite different (Fig. 1). The Q-K-V generation in transformer 1, combination phase in graph convolutional network (GCN) 2, and convolution layers in CNN 4 involve static MM with constant weights, which can be leveraged by compute-in-memory (CIM) to eliminate costly data movements. However, the dominant MM of attention in transformer, aggregation in GCN, and graph construction in vision GNN (ViG) 6 is dynamic that neither input is constant, degrading the benefits brought by CIM. In addition, the varying sparsity of MM in different operators typically demands different zero-skipping granularity, leading to different hardware overheads. Therefore, a domain specific AI accelerator faces three main challenges: 1) the customized design scheme for numerous and every-changing AI operators or models leads to excessive and divergent hardware modules, limiting flexibility and overall utilization 4; 2) the unified computing array based on CIM cannot efficiently and suitably process MM with varying sparsity, scale, and data formats; 3) the massive data movements between adjacent operators cause frequent and intensive off-chip memory accesses, resulting in high latency and energy consumption.
Qiu et al. (Sun,) studied this question.