The deployment of transformer-based language models on resource-constrained edge devices presents fundamental challenges in computational efficiency and memory utilization. We introduce SQ-LoRA (Stable-rank Quantized Low-Rank Adaptation), a theoretically grounded compression framework that achieves unprecedented efficiency through the synergistic integration of adaptive low-rank decomposition, hardware-accelerated structured sparsity, and intelligent hybrid quantization. Our primary contribution establishes the first rigorous mathematical connection between the matrix stable rank and optimal LoRA rank selection, formalized in Theorem I, which provides bounded approximation guarantees. SQ-LoRA implements: (1) adaptive rank allocation via stable-rank analysis to automatically determine layer-wise compression ratios; (2) 4:8 structured sparsity patterns, enabling 2× hardware acceleration on modern edge processors; and (3) a three-tier quantization scheme that combines 4-bit NormalFloat storage with selective 3-bit/8-bit precision to preserve outliers. A comprehensive evaluation on four diverse natural language processing (NLP) benchmarks demonstrates that SQ-LoRA achieves a 320 MB memory footprint (96.7% reduction) and a 10 ms inference latency (91.7% improvement), and maintains 82.0% average accuracy (within 0.15% of the full model). Statistical significance testing (p < 0.001) confirms its superiority over state-of-the-art methods. This framework enables the deployment of sophisticated language models on devices with 2 GB of RAM, advancing practical edge-AI applications.
Building similarity graph...
Analyzing shared references across papers
Loading...
Bayat Toksöz
Işik
Building similarity graph...
Analyzing shared references across papers
Loading...
Toksöz et al. (Sat,) studied this question.
www.synapsesocial.com/papers/699d3ff8de8e28729cf64d10 — DOI: https://doi.org/10.3390/app16042113
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: