Building on the Joint-Embedding Predictive Architecture (JEPA) paradigm, a recent self-supervised learning framework that predicts latent representations of masked regions in high-level feature spaces, we propose Audio-JEPA (Audio Joint-Embedding Predictive Architecture), tailored specifically for audio data. Audio-JEPA uses a simple Vision Transformer backbone to predict latent representations of masked spectrogram patches rather than reconstructing raw audio. We pre-train on unlabeled AudioSet clips (10s, 32kHz) with random patch masking on mel-spectrograms. We evaluate on the X-ARES suite covering speech, music, and environmental sound tasks. Although our implementation is a straightforward translation of the original model to audio, the results still show comparable performance to wav2vec 2.0 and data2vec while using less than one-fifth of their training data and with no hyper-parameter tuning. All code and pretrained checkpoints will be released on GitHub.
Building similarity graph...
Analyzing shared references across papers
Loading...
Ludovic Tuncay
Centre National de la Recherche Scientifique
Étienne Labbé
Centre National de la Recherche Scientifique
Emmanouil Benetos
Building similarity graph...
Analyzing shared references across papers
Loading...
Tuncay et al. (Wed,) studied this question.
synapsesocial.com/papers/68f5fcdc8d54a28a75cf235e — DOI: https://doi.org/10.48550/arxiv.2507.02915
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: