What question did this study set out to answer?

The central aim is to understand how large multimodal models perform in continual instruction tuning and their susceptibility to forgetting prior tasks.

March 15, 2026

Continual Instruction Tuning for Large Multimodal Models

Key Points

The central aim is to understand how large multimodal models perform in continual instruction tuning and their susceptibility to forgetting prior tasks.
Established a benchmark for continual instruction tuning of large multimodal models.
Investigated the impact of continual learning methods on these models.
Explored task-similarity dynamics to improve training strategies.
Catastrophic forgetting was observed in large multimodal models during continual instruction tuning.
Certain continual learning methods were effective in improving performance in various scenarios.
Proposed strategies leveraging task similarity helped enhance model outcomes.

Abstract

Instruction tuning has become a widely adopted approach for aligning large multimodal models (LMMs) with human intent. It enables multi-task joint training through unified data formats. However, as new vision-language tasks constantly emerge, exhaustive joint training of all tasks becomes impractical. Continual learning offers a more flexible and resource-efficient alternative, enabling incremental training of LMMs on emerging tasks. This study investigates two fundamental questions when applying continual learning to instruction tuning of LMMs: 1) Do LMMs suffer from catastrophic forgetting during continual instruction tuning? 2) Can existing continual learning methods be effectively applied to continual instruction tuning of LMMs? A comprehensive study was conducted to answer these questions. First, we establish the first benchmark for continual instruction tuning of LMMs and reveal the phenomenon of catastrophic forgetting in this setup. Second, we integrate and adapt traditional continual learning approaches to this setting, demonstrating the effectiveness of these strategies to varying degrees in different scenarios. Third, we explore task-similarity dynamics between pairs of vision-language tasks and propose task-similarity-informed regularization and model expansion methods. Experimental results show that our approach can consistently boost the model's performance.

اسأل الذكاء الاصطناعي

Bookmark