March 3, 2026Open Access

Rethinking Differentiability: A Comparative Study of Smooth vs Non-Smooth Activation Functions in CNNs

Puntos clave

ReLU achieved the highest average classification accuracy despite being non-differentiable at zero, revealing unexpected performance.
Leaky ReLU exhibited stable learning behavior with reduced variance, indicating potential benefits in training stability.
Swish and Mish functions, though differentiable and smooth, did not outperform ReLU in classification accuracy during testing.
Findings suggest the importance of empirical data over mathematical assumptions when evaluating activation functions.

Resumen

This study re-evaluates the concepts of differentiability and mathematical continuity in activation functions and experimentally investigates the impact of these features on the performance of convolutional neural networks (CNNs). Although differentiable activation functions, such as Swish and Mish, have become prevalent in recent years, the contribution of these features to learning performance remains ambiguous, particularly in shallow architectures. A controlled comparative study was conducted on the CIFAR-10 dataset. Five common activation functions, namely ReLU, Leaky ReLU, Softplus, Swish, and Mish, were evaluated. Each function was trained thrice under the same CNN architecture and training settings, and the classification accuracy and training stability were analyzed in tandem. The findings of this study indicated that ReLU, which is not differentiable at the zero point, achieved the highest average accuracy. In contrast, Leaky ReLU demonstrated a more stable learning behavior with reduced variance. The Swish and Mish functions, which possess differentiable and smooth structures, demonstrated consistent behavior throughout the learning process; however, they did not exhibit the anticipated superiority in terms of accuracy. The Softplus function demonstrated the least favorable performance, attributable to its proclivity for saturating. These findings suggest that, despite the appeal of mathematical differentiability and continuity in theory, they do not offer a direct advantage in terms of CNN performance in practice. The effectiveness of activation functions is predominantly shaped by the architectural structure and learning dynamics. This study proposes an original perspective that emphasizes the prioritization of evaluations based on empirical data over mathematical assumptions when selecting activation functions.

Me gusta

Guardar

Ver artículo completo