This paper presents a rigorous mathematical analysis of optimization algorithms central to deep learning, including Gradient Descent (GD), Stochastic Gradient Descent (SGD), Momentum, Adam, and AMSGrad. We compare and discuss the update rules for each algorithm, delving into their underlying mathematical techniques such as Taylor expansions for approximating loss functions and gradients, and the theory of dynamical systems for understanding acceleration properties. We prove their convergence properties under standard assumptions, including convexity, smoothness (Lipschitz continuity of gradients), and strong convexity. Furthermore, we analyze their rates of convergence for various scenarios, such as O(1/t) for convex and smooth functions in GD, and O(1/√t) for stochastic methods in non-convex settings. We also consider the impact of bounded gradients in stochastic settings and the use ofm Lyapunov functions for proving convergence. Through this analysis, we aim to bridge the gap between theory and practice, offering insights into the design and application of optimization algorithms in deep learning.
Essang et al. (Tue,) studied this question.