Key points are not available for this paper at this time.
In this paper, we introduce Captioning decoder Contrastive Language-Audio Pretraining with data Augmantation (C-CLAPA), a new Audio-Text model for the Cross Domain Retrieval (CDR) task. The model's backbone is comprised of two encoders, one for the text and the other for the audio. The embedding vectors from the different modalities are commonly trained with a contrastive-loss. In our approach, a captioning decoder is also used to generate a text-description from the embedding vector of the audio sample. This decoder is used to ensure that the audio embedding encapsulates text information, and is used only on training stage. Data preparations including filtering, augmentations and text generation utilizing Large Language Models (LLMs), are used to extend the current training dataset. The proposed model is finally trained using a curriculum training procedure. In this approach, we train the model on datasets with increasing quality. In our empirical investigation, we provide compelling evidence that our model significantly surpasses the current State Of The Art (SOTA) models on the available benchmarks. Ablation analysis provides empirical evidence showcasing the advantages in the proposed architectural design as well as the efficacy of the employed data processing methodology.
Sofer et al. (Mon,) studied this question.