What question did this study set out to answer?

This research aims to investigate Adam’s geometric relationship with natural gradient descent (NGD) across different loss landscapes.

June 1, 2026Open Access

How Far is Adam from Natural Gradient Descent?

Puntos clave

This research aims to investigate Adam’s geometric relationship with natural gradient descent (NGD) across different loss landscapes.
Analyzed Adam's full update rule including momentum as a diagonal empirical Fisher approximation.
Evaluated Adam's performance across well-conditioned and ill-conditioned linear regression, logistic regression, and a small non-convex neural network.
Used the scale-invariant γ(∆θ) metric to measure geometric deviation from true NGD.
Adam’s geometric deviation is low in well-conditioned settings but rises significantly in ill-conditioned landscapes, reaching misalignments of approximately 103 in the neural network.
Higher geometric drift is associated with slower initial optimization but does not impair final loss minimization; Adam consistently achieves low loss.
The improved empirical Fisher (iEF) provides more stable paths compared to the standard empirical Fisher (EF), which is prone to oscillation or divergence.

Resumen

Adam is the standard optimizer in deep learning, yet its geometric relationship to natural gra-dient descent (NGD) contains unresolved questions. We study Adam’s full update rule, includ-ing momentum, as a diagonal empirical Fisher approximation subject to diagonal truncation,empirical label substitution, and temporal lag. Using the scale-invariant γ(∆θ) metric, we mea-sure Adam’s geometric deviation from true NGD across four loss landscapes: well-conditionedlinear regression, ill-conditioned linear regression, logistic regression, and a non-convex smallneural network. Adam’s geometric trajectory is context-dependent. Deviation remains low inwell-conditioned settings but rises significantly under ill-conditioning, reaching misalignments of≈ 103 in the neural network. Higher geometric drift correlates with slower initial optimizationbut does not degrade final objective minimization; Adam consistently reaches low loss. Further-more, the improved empirical Fisher (iEF) tracks more stable paths than the standard empiricalFisher (EF), which frequently oscillates or diverges. Our results suggest Adam’s practical op-timization power may stem from a balance of structural approximation errors and momentumsmoothing rather than close tracking of the natural gradient path.

Leer artículo completoexternamente

Me gusta

Guardar

Ver artículo completo