Abstract This work proposes a unified neuromorphic spike-based LLMs (NSLLM) framework to simultaneously address the challenges of high energy consumption and low interpretability in large language models (LLMs). Our framework transforms LLMs into efficient NSLLMs by converting their behaviors into neural dynamics–such as spike trains–through rigorous mathematical modeling and complemented by advanced techniques including quantization and sparsification. This transformation also enables the analysis of information encoding processes using computational neuroscience tools, thereby offering a novel neuroscientific perspective that conceptualizes LLMs as neural populations to enhance their interpretability. Leveraging a hardware-algorithm co-design paradigm, NSLLM can completely eliminate matrix multiplication (MatMul) while maintaining high performance. We designed a custom MatMul-free hardware core on the VCK190 FPGA to validate the 1.5-billion-parameter NSLLM model, achieving a dynamic power consumption of only 13.849 watts and an inference throughput of 161.8 tokens per second. Compared with the A800 GPU, this implementation improves energy efficiency, memory usage, and inference throughput by 19.8×, 21.3×, and 2.2×, respectively. This work provides a novel perspective within a unified framework to enhance both the energy efficiency and interpretability of LLMs, offering valuable insights for future neuromorphic chip designs tailored for large models.
Xu et al. (Wed,) studied this question.