What question did this study set out to answer?

This research aims to develop a semi-supervised learning method that effectively uses unlabeled audio data for training.

May 10, 2026Open Access

Semi-supervised text-audio contrastive learning method using pseudo-text input

Puntos clave

This research aims to develop a semi-supervised learning method that effectively uses unlabeled audio data for training.
Proposed a method that combines labeled and unlabeled audio clips, generating pseudo-text for unlabeled audio using an audio-to-text mapper.
Investigated two mapper variants: a multi-layer perceptron (MLP) and a Transformer decoder, both trained with encoders.
Utilized three InfoNCE losses to drive the training process and added Gaussian noise to enhance the pseudo-text mapping.
Achieved a 6.6% relative improvement in Recall@1 on AudioCaps and a 4.4% gain on Clotho compared to the fine-tuned CLAP baseline.
Demonstrated superior performance over baselines trained on larger fully-labeled datasets without requiring extra captions.

Resumen

This paper proposes a semisupervised audio-text contrastive learning method based on pseudo-text inputs. The proposed method converts unlabeled audio into effective training data without requiring additional annotations. Its key idea is to mix a labeled audio clip with an unlabeled one. Because the unlabeled clip lacks a textual counterpart, the authors generate a pseudo-text input for the unlabeled clip through an audio-to-text mapper (a2t). The authors investigate two mapper variants: a simple multi-layer perceptron (MLP) that outputs a single vector and a Transformer decoder that produces a short sequence of query tokens; both are trained jointly with the encoders. Training is driven by three InfoNCE losses: one on labeled pairs, one on mixed (labeled + unlabeled) pairs and one on unlabeled audio-pseudo-text pairs. Gaussian noise is added to the audio embedding to regularize the pseudo-text mapping. On cross-modal retrieval, our method yields a 6.6% relative improvement in Recall@1 on AudioCaps and a 4.4% gain on Clotho over a fine-tuned CLAP baseline. Without requiring any captions for the additional audio, our method surpasses baselines trained on larger fully-labeled data sets.

Leer artículo completoexternamente

Me gusta

Guardar

Ver artículo completo