What question did this study set out to answer?

The aim is to enhance computational efficiency and memory utilization of language models on edge devices through an innovative compression framework.

February 24, 2026Open Access

SQ-LoRA: Memory-Efficient Language Model Compression Through Stable-Rank-Guided Quantization for Edge Computing Applications

Key Points

The aim is to enhance computational efficiency and memory utilization of language models on edge devices through an innovative compression framework.
Introduced SQ-LoRA as a compression framework combining low-rank adaptation with quantization and sparsity.
Implemented adaptive rank allocation using stable-rank analysis for layer-wise compression ratios.
Utilized 4:8 structured sparsity patterns facilitating hardware acceleration on edge processors.
Developed a three-tier quantization scheme optimizing memory storage and precision retention.
Achieved a 320 MB memory footprint, a 96.7% reduction from typical models.
Reduced inference latency to 10 ms, improving performance by 91.7%.
Maintained 82.0% average accuracy, within 0.15% of full model accuracy.
Statistical significance confirms superiority over previous methods (p < 0.001).

Abstract

The deployment of transformer-based language models on resource-constrained edge devices presents fundamental challenges in computational efficiency and memory utilization. We introduce SQ-LoRA (Stable-rank Quantized Low-Rank Adaptation), a theoretically grounded compression framework that achieves unprecedented efficiency through the synergistic integration of adaptive low-rank decomposition, hardware-accelerated structured sparsity, and intelligent hybrid quantization. Our primary contribution establishes the first rigorous mathematical connection between the matrix stable rank and optimal LoRA rank selection, formalized in Theorem I, which provides bounded approximation guarantees. SQ-LoRA implements: (1) adaptive rank allocation via stable-rank analysis to automatically determine layer-wise compression ratios; (2) 4:8 structured sparsity patterns, enabling 2× hardware acceleration on modern edge processors; and (3) a three-tier quantization scheme that combines 4-bit NormalFloat storage with selective 3-bit/8-bit precision to preserve outliers. A comprehensive evaluation on four diverse natural language processing (NLP) benchmarks demonstrates that SQ-LoRA achieves a 320 MB memory footprint (96.7% reduction) and a 10 ms inference latency (91.7% improvement), and maintains 82.0% average accuracy (within 0.15% of the full model). Statistical significance testing (p < 0.001) confirms its superiority over state-of-the-art methods. This framework enables the deployment of sophisticated language models on devices with 2 GB of RAM, advancing practical edge-AI applications.

Read Full Paperexternally

AI에게 질문

Bookmark

View Full Paper