What type of study is this?

This is a Quantitative Study study.

October 20, 2025Open Access

Audio-JEPA: Joint-Embedding Predictive Architecture for Audio Representation Learning

Key Points

Audio-JEPA predicts latent representations from masked spectrogram patches in audio data, enhancing representation learning.
Pre-training on unlabeled AudioSet clips showed comparable performance to wav2vec 2.0 using one-fifth of the training data.
The framework employs a Vision Transformer backbone, adapting a successful model for efficient audio representation learning.
Results indicate that Audio-JEPA can achieve competitive accuracy with minimal hyper-parameter tuning and small datasets.

Abstract

Building on the Joint-Embedding Predictive Architecture (JEPA) paradigm, a recent self-supervised learning framework that predicts latent representations of masked regions in high-level feature spaces, we propose Audio-JEPA (Audio Joint-Embedding Predictive Architecture), tailored specifically for audio data. Audio-JEPA uses a simple Vision Transformer backbone to predict latent representations of masked spectrogram patches rather than reconstructing raw audio. We pre-train on unlabeled AudioSet clips (10s, 32kHz) with random patch masking on mel-spectrograms. We evaluate on the X-ARES suite covering speech, music, and environmental sound tasks. Although our implementation is a straightforward translation of the original model to audio, the results still show comparable performance to wav2vec 2.0 and data2vec while using less than one-fifth of their training data and with no hyper-parameter tuning. All code and pretrained checkpoints will be released on GitHub.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Ludovic Tuncay

Centre National de la Recherche Scientifique

Étienne Labbé

Centre National de la Recherche Scientifique

Emmanouil Benetos

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Audio-JEPA: Joint-Embedding Predictive Architecture for Audio Representation Learning

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study

Also consider

Also consider