In this paper, we compare two methods, stochastic gradient descent (SGD) and gradient descent (GD), which are optimization algorithms used to minimize loss functions in machine learning. GD updates the model parameters by calculating the gradient over the entire dataset before taking a step. This ensures stable convergence but is computationally expensive. On the other hand, SGD updates the parameters after processing a single random data point, making it much faster but introducing noise. GD follows a smooth path to a minimum, while SGD takes a noisy, winding path, sometimes exceeding a local minimum but also escaping it. For large datasets, GD becomes inefficient, while SGD scales well and is typically used in deep learning. To balance stability and efficiency, both methods aim to find the optimal parameters for machine learning models, with GD focusing on accuracy and SGD on speed.
Salman et al. (Wed,) studied this question.