March 3, 2026Open Access

MAC: Multimodal Attentive Contrastive Learning Framework

Key Points

Multimodal attentive contrastive learning improves joint representation for audio and text data.
Empirical results show a performance increase in downstream classification by 15% on low-data tasks.
Assessment using extensive evaluations on pre-trained language and audio models confirms the framework's effectiveness.
Highlights the potential for enhanced learning in multimodal systems, especially where labeled data is limited.

Abstract

The growing capabilities of large-scale models in text and audio have significantly advanced multimodal learning. However, many downstream tasks still suffer from insufficient labeled data, challenging the effective learning of robust multimodal representations. To address these challenges, we propose the Multimodal Attentive Contrastive (MAC) learning framework, which integrates contrastive learning with a Mixture of Experts (MoE) mechanism to enhance multimodal representation learning for text and audio data. Our approach leverages pre-trained foundation models to generate high-quality unimodal embeddings, which are further refined through unsupervised contrastive learning. This contrastive model aligns multimodal audio-text pairs, improving their joint representation. A novel MoE-based attention mechanism is introduced, wherein modality-specialized expert networks dynamically combine these embeddings based on sample-specific gating weights. This design enhances the model's ability to balance modality contributions, especially in low-data settings. We perform extensive empirical evaluations on multiple pre-trained language and audio models, comparing different contrastive training configurations and validating the effectiveness of MAC through rigorous cross-validation experiments. Empirical results demonstrate that our framework improves downstream classification performance by effectively leveraging contrastive objectives and MoE, outperforming traditional multimodal learning approaches, particularly in low-data scenarios.

Bookmark

View Full Paper

Bookmark

View Full Paper

MAC: Multimodal Attentive Contrastive Learning Framework

Key Points

Abstract

Cite This Study