October 26, 2024Open Access

Advancing Multi-grained Alignment for Contrastive Language-Audio Pre-training

Key Points

Key points are not available for this paper at this time.

Abstract

Recent advances have been witnessed in audio-language joint learning, such as CLAP, that shows much success in multi-modal understanding tasks.These models usually aggregate uni-modal local representations, namely frame or word features, into global ones, on which the contrastive loss is employed to reach coarse-grained cross-modal alignment.However, frame-level correspondence with texts may be ignored, making it ill-posed on explainability and fine-grained challenges which may also undermine performances on coarse-grained tasks.In this work, we aim to improve both coarse-and fine-grained audio-language alignment in large-scale contrastive pre-training.To unify the granularity and latent distribution of two modalities, a shared codebook is adopted to represent multi-modal global features with common bases, and each codeword is regularized to encode modality-shared semantics, bridging the gap between frame and word features.Based on it, a localityaware block is involved to purify local patterns, and a hard-negative guided loss is devised to boost alignment.Experiments on eleven zero-shot coarse-and fine-grained tasks suggest that our model not only surpasses the baseline CLAP significantly but also yields superior or competitive results compared to current SOTA works.

Advancing Multi-grained Alignment for Contrastive Language-Audio Pre-training

Key Points

Abstract

Cite This Study

Also Consider

Also Consider