As the deployment of AI solutions continues to grow, particularly in resource-constrained environments, the need for efficient and cost-effective methods becomes increasingly critical. Large Language Models (LLMs) present significant computational challenges that often make their deployment impractical for many real-world applications. This study evaluates parameter-efficient fine-tuning methods, specifically QLoRA and Prompt Tuning, in combination with DistilBERT, to address these challenges. Our combined approach achieved a 36.2% reduction in memory usage and a 50% reduction in inference costs while maintaining 87.75% accuracy compared to baseline models. The results demonstrate that stacking these techniques can provide multiplicative benefits in resource reduction without significant performance degradation, offering practical solutions for resource-constrained deployments.
Shakti et al. (Sun,) studied this question.