This paper proposes a semisupervised audio-text contrastive learning method based on pseudo-text inputs. The proposed method converts unlabeled audio into effective training data without requiring additional annotations. Its key idea is to mix a labeled audio clip with an unlabeled one. Because the unlabeled clip lacks a textual counterpart, the authors generate a pseudo-text input for the unlabeled clip through an audio-to-text mapper (a2t). The authors investigate two mapper variants: a simple multi-layer perceptron (MLP) that outputs a single vector and a Transformer decoder that produces a short sequence of query tokens; both are trained jointly with the encoders. Training is driven by three InfoNCE losses: one on labeled pairs, one on mixed (labeled + unlabeled) pairs and one on unlabeled audio-pseudo-text pairs. Gaussian noise is added to the audio embedding to regularize the pseudo-text mapping. On cross-modal retrieval, our method yields a 6.6% relative improvement in Recall@1 on AudioCaps and a 4.4% gain on Clotho over a fine-tuned CLAP baseline. Without requiring any captions for the additional audio, our method surpasses baselines trained on larger fully-labeled data sets.
Komatsu et al. (Thu,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: