Adam is the standard optimizer in deep learning, yet its geometric relationship to natural gra-dient descent (NGD) contains unresolved questions. We study Adam’s full update rule, includ-ing momentum, as a diagonal empirical Fisher approximation subject to diagonal truncation,empirical label substitution, and temporal lag. Using the scale-invariant γ(∆θ) metric, we mea-sure Adam’s geometric deviation from true NGD across four loss landscapes: well-conditionedlinear regression, ill-conditioned linear regression, logistic regression, and a non-convex smallneural network. Adam’s geometric trajectory is context-dependent. Deviation remains low inwell-conditioned settings but rises significantly under ill-conditioning, reaching misalignments of≈ 103 in the neural network. Higher geometric drift correlates with slower initial optimizationbut does not degrade final objective minimization; Adam consistently reaches low loss. Further-more, the improved empirical Fisher (iEF) tracks more stable paths than the standard empiricalFisher (EF), which frequently oscillates or diverges. Our results suggest Adam’s practical op-timization power may stem from a balance of structural approximation errors and momentumsmoothing rather than close tracking of the natural gradient path.
Vihaan Paka-Hegde (Tue,) studied this question.