The growing capabilities of large-scale models in text and audio have significantly advanced multimodal learning. However, many downstream tasks still suffer from insufficient labeled data, challenging the effective learning of robust multimodal representations. To address these challenges, we propose the Multimodal Attentive Contrastive (MAC) learning framework, which integrates contrastive learning with a Mixture of Experts (MoE) mechanism to enhance multimodal representation learning for text and audio data. Our approach leverages pre-trained foundation models to generate high-quality unimodal embeddings, which are further refined through unsupervised contrastive learning. This contrastive model aligns multimodal audio-text pairs, improving their joint representation. A novel MoE-based attention mechanism is introduced, wherein modality-specialized expert networks dynamically combine these embeddings based on sample-specific gating weights. This design enhances the model's ability to balance modality contributions, especially in low-data settings. We perform extensive empirical evaluations on multiple pre-trained language and audio models, comparing different contrastive training configurations and validating the effectiveness of MAC through rigorous cross-validation experiments. Empirical results demonstrate that our framework improves downstream classification performance by effectively leveraging contrastive objectives and MoE, outperforming traditional multimodal learning approaches, particularly in low-data scenarios.
Naderi et al. (Mon,) studied this question.