While unbounded activations like ReLU and SiLU drive modern architectures, their lack of geometric constraints necessitates compensatory architectures to mitigate failure modes in stability, quantization, and implicit modeling. We address this with a constraints-first design framework yielding the **SoftCap family**, a C⁰-C² progression of bounded rectifiers: **SoftCap** (exact-zero), **SwishCap** (C¹ derivative-matched), and **SparseCap** (C² quintic notch), ready as drop-in replacements. Derived in closed form and anchored by variance-preserving initialization, we replace benchmark-driven empirical search with principled forward constraints. In grokking stress tests, SwishCap achieves 100% survivorship across all 16 aggressive configurations, supporting the mechanism that origin-adjacent recovery geometry and tight forward scale jointly expand the safe operating region. In transformers, the SoftCap family remains stable at learning-rate multipliers of **up to 80** where standard controls collapse, while providing perplexity gains over standard GELU baselines and reducing peak attention scores by 3--4, decreasing reliance on explicit clamping; concurrently, bounded FFNs suppress post-activation outliers by 15 and reduce INT8 quantization-induced perplexity degradation by **over 25%**. Across heavy-tailed OOD shifts, the same bounded geometry compresses outlier logit gaps by up to 85. Structurally, SparseCap natively generates 8. 9% structural sparsity in NanoGPT query activations, establishing the mathematical foundation for sparse attention without post-hoc thresholding. Finally, in energy-based models (EBMs), bounded geometry provides an architectural alternative to objective-level landscape regularization: the SoftCap family suppresses score-tail excursions, reduces spurious drift, and raises the high-fidelity sampling ceiling. Together, these findings support geometric activation bounds as a shared mechanism for regulating failure modes across explicit and implicit architectures, offering a unified framework for robust model design.
Building similarity graph...
Analyzing shared references across papers
Loading...
Larry Cai
Australian Regenerative Medicine Institute
Jie Tang
Australian Regenerative Medicine Institute
Monash University
Australian Regenerative Medicine Institute
Building similarity graph...
Analyzing shared references across papers
Loading...
Cai et al. (Fri,) studied this question.
synapsesocial.com/papers/69f6e62e8071d4f1bdfc6bed — DOI: https://doi.org/10.5281/zenodo.19924814