Technologies are developing rapidly, including neural networks. After the advent of deep learning, the models became more complex and deeper every year, which led to a shortage of hardware. The article discusses modern methods for optimizing neural networks with an emphasis on post-learning quantization as the most practical approach for deploying models in conditions of limited computing resources. An overview of key methods, including pruning, quantization, and distillation of knowledge, is presented, and their effectiveness and applicability are compared. Special attention is paid to the advantages and limitations of PTQ, such as model size reduction, faster inference, and compatibility with industrial frameworks. The experimental part presents the results of quantization of MobileNetV2, BERT-base, YOLOv5s, EfficientNet-B0, and DistilBERT models, and analyzes the effect of quantization on the accuracy, speed, and compactness of the models. The results showed that post-learning quantization does an excellent job. This method was able to reduce the size of the model by 3 times, accelerate the inference by up to 40% and lose no more than 1.5% accuracy. The results obtained can become the basis for further research on neural network optimization, combining the quantization method with other methods, and creating new hybrid methods that will take all the advantages of post-learning quantization and offset the disadvantages. After all, post-learning quantization is especially effective for mobile and IoT devices, where energy consumption and memory requirements are critical. And its use for computer vision and natural language processing tasks is already showing applicability and prospects.
Tatarnikova et al. (Wed,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: