We introduce a constraints-first activation design framework: fix a shared bounded positive branch a (x) with (variance-preserving) scale parameter a and vary only the negative side, yielding the SoftCap family: a C⁰–C² progression of bounded rectifiers including SoftCap (exact-zero), SwishCap (C¹ derivative-matched), and SparseCap (C² quintic notch). These activations are derived in closed form, avoiding benchmark-driven empirical search @ramachandran2017searching. In high-learning-rate stress tests, SwishCap achieves 100% survival, demonstrating that negative-branch gradient flow dictates stability @power2022grokking; @balduzzi2017shattered. In transformers, applying bounded activations to query/key projections reduces peak attention scores by 3–4×, expands the high-learning-rate operating window, and reduces reliance on explicit clamping @vaswani2017attention; @dosovitskiy2021vit. Structurally, SparseCap natively routes 8. 9% of queries to exactly zero, establishing an architectural foundation for attention sparsity without post-hoc thresholding. Beyond attention, fully bounded networks inherently suppress post-activation outliers, substantially reducing performance degradation under INT8 quantization compared to unbounded baselines. Finally, under heavy-tailed contamination, bounded variants suppress outlier logit gaps by ~40–85×, imposing a strict confidence ceiling without explicit calibration @ovadia2019can; @guo2017calibration. Together, these results establish a constrained design map in which continuity order and notch geometry determine predictable trade-offs across high-learning-rate stability, structural sparsity, and dynamic-range control.
Building similarity graph...
Analyzing shared references across papers
Loading...
Larry Cai
Australian Regenerative Medicine Institute
Jie TANG
Monash University
Australian Regenerative Medicine Institute
Building similarity graph...
Analyzing shared references across papers
Loading...
Cai et al. (Fri,) studied this question.
synapsesocial.com/papers/69e4739a010ef96374d8f55f — DOI: https://doi.org/10.5281/zenodo.19622248