What question did this study set out to answer?

The objective is to develop a new family of bounded activation functions that improve neural network performance under various constraints.

April 19, 2026Open Access

Beyond ReLU and GELU: SoftCap Bounded Activations for Stability, Sparsity, and Quantization

Key Points

The objective is to develop a new family of bounded activation functions that improve neural network performance under various constraints.
Introduced the SoftCap family of activations including SoftCap, SwishCap, and SparseCap.
Conducted high-learning-rate stress tests to evaluate stability of activations.
Applied bounded activations in transformers to analyze their effects on attention scores and performance under quantization.
Examined the impact of bounded activations on outlier suppression and structural sparsity.
SwishCap achieved 100% survival during stress tests, indicating strong stability.
Bounded activations reduced peak attention scores by 3–4 times in transformers.
SparseCap routed 8.9% of queries to exactly zero, enhancing attention sparsity without additional thresholds.
Fully bounded networks demonstrated significantly reduced performance degradation under INT8 quantization compared to unbounded models.
Bounded activations suppressed outlier logit gaps by ~40–85 times during contamination events.

Abstract

We introduce a constraints-first activation design framework: fix a shared bounded positive branch a (x) with (variance-preserving) scale parameter a and vary only the negative side, yielding the SoftCap family: a C⁰–C² progression of bounded rectifiers including SoftCap (exact-zero), SwishCap (C¹ derivative-matched), and SparseCap (C² quintic notch). These activations are derived in closed form, avoiding benchmark-driven empirical search @ramachandran2017searching. In high-learning-rate stress tests, SwishCap achieves 100% survival, demonstrating that negative-branch gradient flow dictates stability @power2022grokking; @balduzzi2017shattered. In transformers, applying bounded activations to query/key projections reduces peak attention scores by 3–4×, expands the high-learning-rate operating window, and reduces reliance on explicit clamping @vaswani2017attention; @dosovitskiy2021vit. Structurally, SparseCap natively routes 8. 9% of queries to exactly zero, establishing an architectural foundation for attention sparsity without post-hoc thresholding. Beyond attention, fully bounded networks inherently suppress post-activation outliers, substantially reducing performance degradation under INT8 quantization compared to unbounded baselines. Finally, under heavy-tailed contamination, bounded variants suppress outlier logit gaps by ~40–85×, imposing a strict confidence ceiling without explicit calibration @ovadia2019can; @guo2017calibration. Together, these results establish a constrained design map in which continuity order and notch geometry determine predictable trade-offs across high-learning-rate stability, structural sparsity, and dynamic-range control.

Beyond ReLU and GELU: SoftCap Bounded Activations for Stability, Sparsity, and Quantization

Key Points

Abstract

Cite This Study