What question did this study set out to answer?

The aim is to accelerate inference of large language models on edge devices using ternary weights and specialized algorithms.

April 19, 2026Open Access

Efficient Addition-Based Sparse GEMM for Fast Ternary Large Language Model Inference on Edge Devices

Key Points

The aim is to accelerate inference of large language models on edge devices using ternary weights and specialized algorithms.
Proposed an efficient ternary sparse data format for storing non-zero indices.
Designed a novel ternary GEMM algorithm utilizing sparse addition instead of multiplication.
Implemented optimizations on x86 CPUs and Nvidia GPUs for performance enhancement.
Achieved a theoretical speedup of 4 × over dense GEMM with weights having 50% sparsity.
Achieved speedups of 1.3-3.9 × over Eigen Sparse GEMM and 3.3-6.9 × over PyTorch Sparse GEMM.
GPU implementation delivers up to 22 tokens/s for Llama-3 3B models on RTX-3080Ti.

Abstract

Large Language Models (LLMs) are the new dominant application but suffer from high memory and computational cost. Ternary LLMs have been proposed for easier deployment on edge platforms as 2-bit ternary weights -1, 0, +1 can reduce the model size by 16 × compared to FP32 representations. In addition, ternary General Matrix Multiplication (GEMM) can reduce the computational complexity by performing addition and subtraction operations with non-zero weights only. However, existing CPU and GPU do not support native 2-bit operations, and existing libraries like PyTorch and CUDA do not have dedicated computing kernels for ternary weights. Moreover, existing sparse formats like Compressed Sparse Column are not optimized for ternary values, causing extra storage and decompression overhead. In this paper, we accelerate ternary LLMs on edge devices through efficient data formats and specialized computing kernels. We propose an efficient ternary sparse data format storing only the indices of non-zero values and simplifying the decompression at runtime. We also design a novel ternary GEMM algorithm that performs sparse addition on activations instead of multiplication to reduce the computation complexity. It achieves a 4 × theoretical speedup over dense GEMM with 50% sparsity in weights. We have implemented these algorithms and optimized computing kernels on both x86 CPU and Nvidia GPUs. Evaluation results show that they achieve 1. 3-3. 9 × speedup over Eigen Sparse GEMM, 3. 3-6. 9 × speedup over PyTorch Sparse GEMM, and around 5. 5 × speedup over cuSPARSE. The GPU implementation can serve Llama-3 3B and 8B models on an RTX-3080Ti with 22 and 7 tokens/s, while the full-precision versions run out of memory.

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper