June 4, 2024Open Access

Online Learning and Information Exponents: On The Importance of Batch size, and Time/Complexity Tradeoffs

Puntos clave

Los puntos clave no están disponibles para este artículo en este momento.

Resumen

We study the impact of the batch size nb on the iteration time T of training two-layer neural networks with one-pass stochastic gradient descent (SGD) on multi-index target functions of isotropic covariates. We characterize the optimal batch size minimizing the iteration time as a function of the hardness of the target, as characterized by the information exponents. We show that performing gradient updates with large batches nb d^{2} minimizes the training time without changing the total sample complexity, where is the information exponent of the target to be learned arous2021online and d is the input dimension. However, larger batch sizes than nb d^{2} are detrimental for improving the time complexity of SGD. We provably overcome this fundamental limitation via a different training protocol, Correlation loss SGD, which suppresses the auto-correlation terms in the loss function. We show that one can track the training progress by a system of low-dimensional ordinary differential equations (ODEs). Finally, we validate our theoretical results with numerical experiments.

Leer artículo completoexternamente

Preguntar a la IA

Me gusta

Guardar

Ver artículo completo