What question did this study set out to answer?

This research aims to analyze the convergence rates of gradient descent in overparameterized neural networks.

March 15, 2026

Convergence rates for gradient descent in the training of overparameterized artificial neural networks with piecewise affine activation

Key Points

This research aims to analyze the convergence rates of gradient descent in overparameterized neural networks.
Considered fully connected shallow artificial neural networks with piecewise affine activation.
Analyzed gradient descent optimization in the overparameterized regime.
Examined the relationship between network width, learning rate, and convergence rate.
Batch gradient descent can achieve zero training loss with high probability.
Convergence occurs at a linear rate when certain conditions on network width and learning rate are met.

Abstract

In recent years, artificial neural networks have developed into a powerful tool for dealing with a multitude of problems for which classical solution approaches reach their limits. However, it is still unclear why randomly initialized gradient descent optimization algorithms, such as the well-known batch gradient descent, are able to achieve zero training loss in many situations even though the objective function is non-convex and non-smooth. One of the most promising approaches to solving this problem in the field of supervised learning is the analysis of gradient descent optimization in the so-called overparameterized regime. In this article we provide a further contribution to this area of research by considering overparameterized fully connected shallow artificial neural networks with piecewise affine activation, such as the rectified linear unit activation. Specifically, given that the activation function is not affine and the training input data are pairwise distinct, we show that, with high probability, the mean squared error using batch gradient descent optimization applied to such a randomly initialized artificial neural network converges to zero at a linear convergence rate as long as the width of the artificial neural network is large enough and the learning rate is small enough.

Bookmark

Convergence rates for gradient descent in the training of overparameterized artificial neural networks with piecewise affine activation

Key Points

Abstract

Cite This Study

Also Consider

Also Consider