We introduce the SoftCap family, bounded rectifying activations derived from explicit continuity and sparsity constraints rather than empirical search @ramachandran2017searching. The family comprises SoftCap (C⁰), SwishCap (C¹), and SparseCap (C²), all sharing a bounded positive branch a (x) with analytically derived, variance-preserving scalar a^* @glorot2010understanding; @he2015delving; @klambauer2017selu. In high-learning-rate grokking stress tests, SwishCap achieves 100% survival across all tested rates, whereas hard-zero variants exhibit sharp collapse boundaries, indicating that origin smoothness and negative-side gradient transport govern stability more strongly than boundedness alone @power2022grokking; @balduzzi2017shattered. Applied after Q/K projections in Muon-trained ViTs, bounded activations reduce peak pre-softmax attention scores by 3–4×, reducing reliance on explicit clamping @vaswani2017attention; @dosovitskiy2021vit. Under heavy-tailed contamination, they suppress outlier logit gaps by over two orders of magnitude, imposing an architectural confidence ceiling without explicit calibration @ovadia2019can; @guo2017calibration. While trailing ReLU/GELU by 4 pp in standard supervised regimes @nair2010relu; @hendrycks2016gelu, these results establish a constrained design map in which continuity order and notch geometry determine predictable trade-offs across stability, sparsity, and dynamic-range control.
Building similarity graph...
Analyzing shared references across papers
Loading...
Larry Cai
Australian Regenerative Medicine Institute
Jie Tang
Zhejiang International Studies University
Monash University
Australian Regenerative Medicine Institute
Building similarity graph...
Analyzing shared references across papers
Loading...
Cai et al. (Sun,) studied this question.
synapsesocial.com/papers/69a7ccf7d48f933b5eed8d9e — DOI: https://doi.org/10.5281/zenodo.18829083