Audio-Visual Speech Recognition (AVSR) has been studied for a long time in the literature. By leveraging the complementary information from both acoustic and visual modalities, this approach offers a promising solution for robust speech transcription. While recent AVSR models have achieved impressive performance on large-scale, uniformly distributed datasets, they often overlook the challenges posed by real-world scenarios-where data is collected across multiple sessions and environments, leading to significant domain shifts and heterogeneous distributions. Such heterogeneity can result in catastrophic forgetting and hinder the generalization ability of the conventional models. To bridge this gap, we introduce the Continual Audio-Visual Speech Recognition (CL-AVSR) problem, which formulates AVSR as a continual learning task. We establish a dedicated benchmark for CL-AVSR by designing three experimental scenarios that reflect real-world challenges: introducing varying background noise for the audio stream, degrading video quality for the visual stream, and dividing tasks by speaker characteristics to jointly affect both modalities. These scenarios systematically evaluate the model's ability to adapt and retain knowledge across dynamic and non-stationary data streams. To address the unique challenges of CL-AVSR, we propose the Interaction-enhanced Multimodal Prompt learning (IMP) framework. IMP builds upon a pre-trained AV-HuBERT backbone and integrates task-relevant soft prompts with cross-modal and cross-task interactions, enabling efficient knowledge transfer from high-quality source domains to typical low-quality target domains with minimal parameter overhead. The interactive prompts facilitate fine-grained alignment and adaptation between modalities and tasks, while contrastive regularization further mitigates catastrophic forgetting. Furthermore, we devise a multi-modal prompt selection strategy that leverages clustering-based feature analysis, empowering the model to dynamically select optimal prompts for unseen data distributions during inference. Extensive experiments on the LRS2 dataset demonstrate that IMP achieves substantial improvements over strong baselines, setting new state-of-the-art performance in all CL-AVSR scenarios. Our results highlight the effectiveness of IMP in enhancing continual learning capabilities for AVSR, paving the way for more robust and adaptable multi-modal speech recognition systems in real-world applications.
Fu et al. (Thu,) studied this question.