Artificial Intelligence (AI) has emerged as a transformative force, increasingly integrated into diverse aspects of modern society, from healthcare and education to business and entertainment. Among the most influential AI technologies are large language models (LLMs), such as generative pretrained transformers (GPTs). These models are designed to process vast amounts of data and perform complex computations, enabling advanced capabilities in natural language understanding and generation. However, deployment and operation of such systems requires significant computational resources, leading to substantial energy consumption. While general-purpose hardware such as GPUs is limited by fixed-precision architectures, field-programmable gate arrays (FPGAs) offer the bit-level reconfigurability needed to exploit ultra-low-bitwidth representations. This allows power-intensive multiplications to be replaced by streamlined logic-based accumulations, maximizing the energy benefits of model quantization. This paper addresses the problem of the energy impact of LLMs by leveraging innovative FPGA-based heterogeneous computing platforms. Results demonstrate that ternary matrix multiplication (MatMul) achieves a 23% speedup and a remarkable 96% reduction in digital signal processor (DSP) utilization. Furthermore, the final optimized design shows a 52% reduction in total energy consumption compared to the baseline, making heterogeneous computing a compelling solution for power- and resource-constrained embedded applications.
Monteiro et al. (Tue,) studied this question.