May 7, 2020Open Access

GOBO: Quantizing Attention-Based NLP Models for Low Latency and Energy Efficient Inference

Key Points

Key points are not available for this paper at this time.

Abstract

Attention-based models have demonstrated remarkable success in various natural language understanding tasks. However, efficient execution remains a challenge for these models which are memory-bound due to their massive number of parameters. We present GOBO, a model quantization technique that compresses the vast majority (typically 99.9%) of the 32-bit floating-point parameters of state-of-the-art BERT models and their variants to 3 bits while maintaining their accuracy. Unlike other quantization methods, GOBO does not require fine-tuning nor retraining to compensate for the quantization error. We present two practical hardware applications of GOBO. In the first GOBO reduces memory storage and traffic and as a result inference latency and energy consumption. This GOBO memory compression mechanism is plug-in compatible with many architectures; we demonstrate it with the TPU, Eyeriss, and an architecture using Tensor Cores-like units. Second, we present a co-designed hardware architecture that also reduces computation. Uniquely, the GOBO architecture maintains most of the weights in 3b even during computation, a property that: (1) makes the processing elements area efficient, allowing us to pack more compute power per unit area, (2) replaces most multiply-accumulations with additions, and (3) reduces the off-chip traffic by amplifying on-chip memory capacity.

AI से पूछें

Bookmark

View Full Paper

Cite This Study

Zadeh et al. (Thu,) studied this question.

synapsesocial.com/papers/69df4f5e6324afb55d5926b1 https://doi.org/https://doi.org/10.1109/micro50266.2020.00071

AI से पूछें

Bookmark

View Full Paper