Instruction tuning has become a widely adopted approach for aligning large multimodal models (LMMs) with human intent. It enables multi-task joint training through unified data formats. However, as new vision-language tasks constantly emerge, exhaustive joint training of all tasks becomes impractical. Continual learning offers a more flexible and resource-efficient alternative, enabling incremental training of LMMs on emerging tasks. This study investigates two fundamental questions when applying continual learning to instruction tuning of LMMs: 1) Do LMMs suffer from catastrophic forgetting during continual instruction tuning? 2) Can existing continual learning methods be effectively applied to continual instruction tuning of LMMs? A comprehensive study was conducted to answer these questions. First, we establish the first benchmark for continual instruction tuning of LMMs and reveal the phenomenon of catastrophic forgetting in this setup. Second, we integrate and adapt traditional continual learning approaches to this setting, demonstrating the effectiveness of these strategies to varying degrees in different scenarios. Third, we explore task-similarity dynamics between pairs of vision-language tasks and propose task-similarity-informed regularization and model expansion methods. Experimental results show that our approach can consistently boost the model's performance.
He et al. (Thu,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: