Over the past decade, Deep Learning has fundamentally reshaped Machine Learning and achieved unprecedented breakthroughs across diverse domains such as computer vision, natural language processing, and scientific discovery. This transformation stems from exponential growth in model size and computational complexity, with training compute for notable state-of-the-art models estimated to double roughly every five months. However, this pursuit of scale comes at substantial environmental and financial costs, which makes improved neural network efficiency a central priority across industry and academia. Among the diverse strategies to achieve greater efficiency, pruning, the systematic removal of seemingly least critical parameters, is particularly effective because it reduces memory and computational demands through the introduction of sparsity into weight tensors. Existing methods range from sophisticated techniques to simple approaches like Iterative Magnitude Pruning (IMP, Han et al., 2015), which, given a pretrained model, removes a fraction of the magnitude-wise smallest weights and retrains the remaining ones to recover performance, iteratively repeating such prune-retrain cycles until a desired level of compression is reached. This thesis contributes to the ongoing pursuit of efficient Deep Learning with a specific focus on the efficiency and effectiveness of prune-retrain pipelines as exemplified by IMP. The four core publications presented aim to advance state-of-the-art neural network pruning through principled algorithmic innovation across different pruning paradigms, and systematically challenge prevailing narratives about the limitations of simple approaches like IMP. First, How I Learned To Stop Worrying And Love Retraining demonstrates that retraining can be drastically accelerated through proper learning rate schedule design. Contrary to the belief that methods which avoid retraining by inducing biases towards sparsity are superior in efficiency and solution quality, IMP with appropriate pruning-adaptive scheduling outperforms significantly more complex approaches at lower computational costs. Second, Compression-Aware Training of Neural Networks using Frank-Wolfe shows that training neural networks with the Stochastic Frank-Wolfe algorithm over a versatile family of structured feasible regions induces amenability to pruning and low-rank decomposition across a wide range of sparsity levels. The proposed approach successfully achieves both compression robustness and state-of-the-art dense model performance. Third, Sparse Model Soups: A Recipe for Improved Pruning via Model Averaging proposes a method that enables to obtain multiple sparse models that can be averaged without reducing the resulting model's sparsity due to zeros cancelling out. Our approach significantly improves upon IMP, and particularly enables parallelization of the otherwise sequential retraining process: retraining multiple copies of the pruned model in parallel for short durations and then averaging the copies yields a sparse model that performs on par or better than a single model retrained for a longer duration. Lastly, PERP: Rethinking the Prune-Retrain Paradigm in the Era of LLMs demonstrates that the retraining of pruned Large Language Models (LLMs) -- typically considered infeasible due to their enormous scale -- can be efficiently accomplished by optimizing as few as 1% of parameters. This approach drastically reduces computational and memory requirements and enables pruning and retraining of 30-billion parameter models on a single GPU within minutes. Overall, these contributions provide novel insights into the efficiency and effectiveness of state-of-the-art pruning algorithms and demonstrate that simple approaches, when properly designed and executed, can be highly effective and efficient.
Max Lennart Zimmer (Thu,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: