What does this research mean for the field?

The SoftCap family of bounded rectifying activations improves stability and sparsity in neural networks compared to traditional activations like ReLU and GELU. Novelty: ClaimNovelty.NOVEL_FINDING. Consensus alignment: ConsensusAlignment.CHALLENGES_CONSENSUS.

What question did this study set out to answer?

This research introduces the SoftCap family of bounded activations aimed at enhancing neural network performance through improved stability and sparsity.

March 4, 2026Open Access

Beyond ReLU and GELU: SoftCap Bounded Activations for Stability and Sparsity

Key Points

This research introduces the SoftCap family of bounded activations aimed at enhancing neural network performance through improved stability and sparsity.
Developed the SoftCap, SwishCap, and SparseCap activation functions with explicit continuity and sparsity constraints.
Conducted stress tests to analyze performance at high learning rates.
Applied bounded activations in ViT (Vision Transformer) models after Q/K projections.
SwishCap achieved 100% survival in high-learning-rate tests compared to harsher variants.
Reduced peak pre-softmax attention scores by 3–4 times in ViT applications.
Suppressed outlier logit gaps by over 100 times under heavy-tailed contamination.

Abstract

We introduce the SoftCap family, bounded rectifying activations derived from explicit continuity and sparsity constraints rather than empirical search @ramachandran2017searching. The family comprises SoftCap (C⁰), SwishCap (C¹), and SparseCap (C²), all sharing a bounded positive branch a (x) with analytically derived, variance-preserving scalar a^* @glorot2010understanding; @he2015delving; @klambauer2017selu. In high-learning-rate grokking stress tests, SwishCap achieves 100% survival across all tested rates, whereas hard-zero variants exhibit sharp collapse boundaries, indicating that origin smoothness and negative-side gradient transport govern stability more strongly than boundedness alone @power2022grokking; @balduzzi2017shattered. Applied after Q/K projections in Muon-trained ViTs, bounded activations reduce peak pre-softmax attention scores by 3–4×, reducing reliance on explicit clamping @vaswani2017attention; @dosovitskiy2021vit. Under heavy-tailed contamination, they suppress outlier logit gaps by over two orders of magnitude, imposing an architectural confidence ceiling without explicit calibration @ovadia2019can; @guo2017calibration. While trailing ReLU/GELU by 4 pp in standard supervised regimes @nair2010relu; @hendrycks2016gelu, these results establish a constrained design map in which continuity order and notch geometry determine predictable trade-offs across stability, sparsity, and dynamic-range control.

Bookmark

View Full Paper

Bookmark

View Full Paper

Beyond ReLU and GELU: SoftCap Bounded Activations for Stability and Sparsity

Key Points

Abstract

Cite This Study