Key points are not available for this paper at this time.
Fine-tuning pre-trained large language models (LLMs) with limited hardware presents challenges due to GPU memory constraints. Various distributed fine-tuning methods have been proposed to alleviate memory constraints on GPU. However, determining the most effective method for achieving rapid fine-tuning while preventing GPU out-of-memory issues in a given environment remains unclear. To address this challenge, we introduce LLMem, a solution that estimates the GPU memory consumption when applying distributed fine-tuning methods across multiple GPUs and identifies the optimal method. We conduct GPU memory usage estimation prior to fine-tuning, leveraging the fundamental structure of transformer-based decoder models and the memory usage distribution of each method. Experimental results show that LLMem accurately estimates peak GPU memory usage on a single GPU, with an error rate of up to 1.6%. Additionally, it shows an average error rate of 3.0% when applying distributed fine-tuning methods to LLMs with more than a billion parameters on multi-GPU setups.
Building similarity graph...
Analyzing shared references across papers
Loading...
Taeho Kim
Yanming Wang
Vatshank Chaturvedi
University of Colorado Boulder
University of Colorado System
Electronics and Telecommunications Research Institute
Building similarity graph...
Analyzing shared references across papers
Loading...
Kim et al. (Fri,) studied this question.
www.synapsesocial.com/papers/68e5ee87b6db643587582f15 — DOI: https://doi.org/10.24963/ijcai.2024/699