What does this research mean for the field?

TernarySSM achieves 2.6× weight compression with only ∼7% quality degradation compared to full-precision language models, enabling multiply-free inference. Novelty: ClaimNovelty.NOVEL_FINDING. Consensus alignment: ConsensusAlignment.NEUTRAL.

What question did this study set out to answer?

This work aims to develop a hardware-efficient language model utilizing ternary weights for better scalability and performance.

February 24, 2026Open Access

Ternary State-Space Models: Hardware-Efficient Language Modeling with -1, 0, +1 Weights

Key Points

This work aims to develop a hardware-efficient language model utilizing ternary weights for better scalability and performance.
Introduced TernarySSM architecture with ternary weights {-1, 0, +1}
Implemented selective state-space models (SSMs) and sliding window attention (SWA)
Executed a series of ablation studies on eight training techniques and quantization scheduling.
Achieved 2.6× weight compression at 50M scale with only ∼7% quality degradation
Test perplexity of ternary model was 36.23 compared to 33.97 for FP16 baseline
Delivered 1.63× inference speedup with 79% VRAM reduction

Abstract

This paper presents TernarySSM, a novel language model architecture combining ternary −1, 0, +1 weight quantization with selective state-space models (SSMs), sliding window attention (SWA), and Mixture-of-Depths (MoD) routing. The architecture achieves 2. 6× weight compression at 50M scale with only ∼7% quality degradation versus full-precision baselines, while enabling multiply-free inference through add/subtract/skip operations. Four key findings are validated: (1) ternary quantization, SSM parallel scan, and MoD routing compose additively with no interaction effects (combined gap +0. 24 matches predicted +0. 23 at 6M) ; (2) direct ternary training from random initialization matches progressive FP16- warmup training within 0. 2%, eliminating the need for multi-stage quantization schedules; (3) the quality gap decreases with scale (+0. 24 at 6M → +0. 06 at 50M), consistent with BitNet literature predictions; (4) three independent gradient paths through the architecture guarantee O (1/ √ T) convergence with bounded quantization offset. On WikiText-103 at 72M parameters, the ternary model achieves test perplexity 36. 23 versus 33. 97 for the FP16 baseline (5 epochs), while a custom Triton inference kernel delivers 1. 63× speedup with 79% VRAM reduction. A comprehensive ablation of eight training techniques for ternary SSMs provides the first systematic study of gradient estimation, quantization scheduling, and optimizer selection in this combined setting

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper