What type of study is this?

This is a Experimental Study study.

October 20, 2025Open Access

End-to-End On-Device Quantization-Aware Training for LLMs at Inference Cost

Key Points

ZeroQAT significantly reduces memory overhead while enabling quantization-aware training for large language models.
Experiments show ZeroQAT outperforms traditional methods and allows fine-tuning of larger models even with low bit-widths.
The framework facilitates on-device training by employing efficient optimization without backpropagation, minimizing resource usage.
Its lightweight variant enables efficient quantized fine-tuning on typical edge devices, showcasing its practicality and efficiency.

Abstract

Quantization is an effective technique to reduce the deployment cost of large language models (LLMs), and post-training quantization (PTQ) has been widely studied due to its efficiency. However, existing PTQ methods are limited by their inability to fine-tune model parameters and often suffer significant accuracy loss in low-bit scenarios. Quantization-aware training (QAT) provides a more principled solution, but its reliance on backpropagation incurs prohibitive memory costs, limiting its practicality for LLM deployment. To address these challenges, we propose ZeroQAT, a zeroth-order optimization-based QAT framework that supports both weight and activation quantization. ZeroQAT leverages forward-only gradient estimation to eliminate backpropagation, substantially reducing computational and memory overhead while retaining the benefits of end-to-end optimization. We further introduce a lightweight variant of ZeroQAT for quantized fine-tuning, which freezes and pre-quantizes most parameters to further cut memory usage. Experiments show that ZeroQAT consistently outperforms representative PTQ and QAT baselines while requiring significantly less memory. For example, ZeroQAT enables fine-tuning of a 13B model at extremely low bit-widths (e.g., 2-4 bits) on a single 8GB GPU, and even allows fine-tuning a 6.7B model on a OnePlus 12 smartphone, demonstrating its practicality for end-to-end QAT on resource-limited edge devices.

Read Full Paperexternally

Ask AI

Helpful

Bookmark

View Full Paper