This paper presents a demonstration of mLoRA, a system for parallel and efficient fine-tuning of Large Language Models (LLMs) using Low-Rank Adaptation (LoRA). mLoRA introduces two core components: LoRAPP, a zero-bubble pipeline parallelism mechanism that leverages the independence of LoRA adapters to maximize GPU utilization across multiple GPUs, and BatchLoRA, a custom operator that consolidates multiple LoRA tasks into batched matrix operations to reduce kernel launch overhead. The system also includes a memory-aware task scheduler for efficient resource allocation. Demonstrated on database-related tasks including Text2SQL and LLM-based data preprocessing (LLM4DP), mLoRA achieves 30–45% faster training compared to existing parallel methods and has been deployed in production at AntGroup. This demo paper was submitted to the PVLDB 2025 Demo Track and serves as a companion to the full research paper accepted at VLDB 2025.
Huang et al. (Sun,) studied this question.