What question did this study set out to answer?

This research aims to improve the efficiency and effectiveness of neural network pruning through advanced prune-retrain techniques.

June 13, 2026Open Access

Effective and efficient prune-retrain pipelines for neural network compression

Key Points

This research aims to improve the efficiency and effectiveness of neural network pruning through advanced prune-retrain techniques.
Focused on iterative magnitude pruning (IMP) as a primary method of compression.
Utilized learning rate schedule design to enhance retraining speed.
Introduced stochastic frank-wolfe algorithm for robust training with various sparsity levels.
Retraining with adaptive pruning schedules significantly outperforms complex methods with lower computational costs.
Achieved state-of-the-art performance with structural pruning techniques and robust model training.
Enabled efficient retraining of large models, making it feasible to work with multi-billion parameter neural networks.

Abstract

Over the past decade, Deep Learning has fundamentally reshaped Machine Learning and achieved unprecedented breakthroughs across diverse domains such as computer vision, natural language processing, and scientific discovery. This transformation stems from exponential growth in model size and computational complexity, with training compute for notable state-of-the-art models estimated to double roughly every five months. However, this pursuit of scale comes at substantial environmental and financial costs, which makes improved neural network efficiency a central priority across industry and academia. Among the diverse strategies to achieve greater efficiency, pruning, the systematic removal of seemingly least critical parameters, is particularly effective because it reduces memory and computational demands through the introduction of sparsity into weight tensors. Existing methods range from sophisticated techniques to simple approaches like Iterative Magnitude Pruning (IMP, Han et al., 2015), which, given a pretrained model, removes a fraction of the magnitude-wise smallest weights and retrains the remaining ones to recover performance, iteratively repeating such prune-retrain cycles until a desired level of compression is reached. This thesis contributes to the ongoing pursuit of efficient Deep Learning with a specific focus on the efficiency and effectiveness of prune-retrain pipelines as exemplified by IMP. The four core publications presented aim to advance state-of-the-art neural network pruning through principled algorithmic innovation across different pruning paradigms, and systematically challenge prevailing narratives about the limitations of simple approaches like IMP. First, How I Learned To Stop Worrying And Love Retraining demonstrates that retraining can be drastically accelerated through proper learning rate schedule design. Contrary to the belief that methods which avoid retraining by inducing biases towards sparsity are superior in efficiency and solution quality, IMP with appropriate pruning-adaptive scheduling outperforms significantly more complex approaches at lower computational costs. Second, Compression-Aware Training of Neural Networks using Frank-Wolfe shows that training neural networks with the Stochastic Frank-Wolfe algorithm over a versatile family of structured feasible regions induces amenability to pruning and low-rank decomposition across a wide range of sparsity levels. The proposed approach successfully achieves both compression robustness and state-of-the-art dense model performance. Third, Sparse Model Soups: A Recipe for Improved Pruning via Model Averaging proposes a method that enables to obtain multiple sparse models that can be averaged without reducing the resulting model's sparsity due to zeros cancelling out. Our approach significantly improves upon IMP, and particularly enables parallelization of the otherwise sequential retraining process: retraining multiple copies of the pruned model in parallel for short durations and then averaging the copies yields a sparse model that performs on par or better than a single model retrained for a longer duration. Lastly, PERP: Rethinking the Prune-Retrain Paradigm in the Era of LLMs demonstrates that the retraining of pruned Large Language Models (LLMs) -- typically considered infeasible due to their enormous scale -- can be efficiently accomplished by optimizing as few as 1% of parameters. This approach drastically reduces computational and memory requirements and enables pruning and retraining of 30-billion parameter models on a single GPU within minutes. Overall, these contributions provide novel insights into the efficiency and effectiveness of state-of-the-art pruning algorithms and demonstrate that simple approaches, when properly designed and executed, can be highly effective and efficient.

Bookmark

View Full Paper