This paper presents TernarySSM, a novel language model architecture combining ternary −1, 0, +1 weight quantization with selective state-space models (SSMs), sliding window attention (SWA), and Mixture-of-Depths (MoD) routing. The architecture achieves 2. 6× weight compression at 50M scale with only ∼7% quality degradation versus full-precision baselines, while enabling multiply-free inference through add/subtract/skip operations. Four key findings are validated: (1) ternary quantization, SSM parallel scan, and MoD routing compose additively with no interaction effects (combined gap +0. 24 matches predicted +0. 23 at 6M) ; (2) direct ternary training from random initialization matches progressive FP16- warmup training within 0. 2%, eliminating the need for multi-stage quantization schedules; (3) the quality gap decreases with scale (+0. 24 at 6M → +0. 06 at 50M), consistent with BitNet literature predictions; (4) three independent gradient paths through the architecture guarantee O (1/ √ T) convergence with bounded quantization offset. On WikiText-103 at 72M parameters, the ternary model achieves test perplexity 36. 23 versus 33. 97 for the FP16 baseline (5 epochs), while a custom Triton inference kernel delivers 1. 63× speedup with 79% VRAM reduction. A comprehensive ablation of eight training techniques for ternary SSMs provides the first systematic study of gradient estimation, quantization scheduling, and optimizer selection in this combined setting
Julio Jose Lena (Sun,) studied this question.