Adan: Adaptive Nesterov Momentum Algorithm for Faster Optimizing Deep Models

Key Points

Adan accelerates optimization in deep learning, achieving faster training speeds with efficient methods.
The algorithm uses a novel Nesterov momentum estimation, avoiding unnecessary computations and enhancing training flow.
Methodologically, Adan reformulates vanilla Nesterov acceleration to derive an adaptive gradient algorithm structure for more effective learning processes at scale with deep networks.  Adan's design calls for additional exploration, as it highlights the potential for increased efficiency in training various deep learning architectures.

Abstract

In deep learning, different kinds of deep networks typically need different optimizers, which have to be chosen after multiple trials, making the training process inefficient. To relieve this issue and consistently improve the model training speed across deep networks, we propose the ADAptive Nesterov momentum algorithm, Adan for short. Adan first reformulates the vanilla Nesterov acceleration to develop a new Nesterov momentum estimation (NME) method, which avoids the extra overhead of computing gradient at the extrapolation point. Then Adan adopts NME to estimate the gradient's first- and second-order moments in adaptive gradient algorithms for convergence acceleration. Besides, we prove that Adan finds an ϵ-approximate first-order stationary point within O(ϵ

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper