We introduce a constraints-first activation design framework: fix a shared bounded positive branch a (x) with (variance-preserving) scale parameter a and vary only the negative side, yielding the SoftCap family: a C⁰–C² progression of bounded rectifiers including SoftCap (exact-zero), SwishCap (C¹ derivative-matched), and SparseCap (C² quintic notch). These activations are derived in closed form, avoiding benchmark-driven empirical search @ramachandran2017searching. In high-learning-rate stress tests, SwishCap achieves 100% survival, demonstrating that negative-branch gradient flow dictates stability @power2022grokking; @balduzzi2017shattered. In transformers, applying bounded activations to query/key projections reduces peak attention scores by 3–4×, expands the high-learning-rate operating window, and reduces reliance on explicit clamping @vaswani2017attention; @dosovitskiy2021vit. Structurally, SparseCap natively routes 8. 9% of queries to exactly zero, establishing an architectural foundation for attention sparsity without post-hoc thresholding. Beyond attention, fully bounded networks inherently suppress post-activation outliers, substantially reducing performance degradation under INT8 quantization compared to unbounded baselines. Finally, under heavy-tailed contamination, bounded variants suppress outlier logit gaps by ~40–85×, imposing a strict confidence ceiling without explicit calibration @ovadia2019can; @guo2017calibration. Together, these results establish a constrained design map in which continuity order and notch geometry determine predictable trade-offs across high-learning-rate stability, structural sparsity, and dynamic-range control.
Cai et al. (Fri,) studied this question.