We introduce the SoftCap family, bounded rectifying activations derived from explicit continuity and sparsity constraints rather than empirical search @ramachandran2017searching. The family comprises SoftCap (C⁰), SwishCap (C¹), and SparseCap (C²), all sharing a bounded positive branch a (x) with analytically derived, variance-preserving scalar a^* @glorot2010understanding; @he2015delving; @klambauer2017selu. In high-learning-rate grokking stress tests, SwishCap achieves 100% survival across all tested rates, whereas hard-zero variants exhibit sharp collapse boundaries, indicating that origin smoothness and negative-side gradient transport govern stability more strongly than boundedness alone @power2022grokking; @balduzzi2017shattered. Applied after Q/K projections in Muon-trained ViTs, bounded activations reduce peak pre-softmax attention scores by 3–4×, reducing reliance on explicit clamping @vaswani2017attention; @dosovitskiy2021vit. Under heavy-tailed contamination, they suppress outlier logit gaps by over two orders of magnitude, imposing an architectural confidence ceiling without explicit calibration @ovadia2019can; @guo2017calibration. While trailing ReLU/GELU by 4 pp in standard supervised regimes @nair2010relu; @hendrycks2016gelu, these results establish a constrained design map in which continuity order and notch geometry determine predictable trade-offs across stability, sparsity, and dynamic-range control.
Cai et al. (Sun,) studied this question.