Speech Guided Masked Image Modeling for Visually Grounded Speech

Key Points

Key points are not available for this paper at this time.

Abstract

The objective of this study is to investigate the learning process of Visually Grounded Speech (VGS) models through joint learning that combines contrastive learning and masked image modeling. Typically, VGS models ahn to establish audio-visual alignment between images and then spoken captions within a contrastive learning framework. Building upon this seminal concept, in this work, we explore whether visual reconstruction with the help of cross-modality can enhance alignment, given that spoken captions describe visual appearances. To achieve this, we extend the contrastive learning-based VGS models by incorporating a masked autoencoder that utilizes cross-attention in the decoder. Through this cross-modal interaction in the decoder, spoken caption features guide the model to reconstruct the masked patches and capture correspondence between the two modalities. Our findings suggest that integrating cross-modal reconstruction within the contrastive learning framework enhances audio-visual feature alignment. Consequently, our proposed method gives comparable performance to existing models that utilize prior knowledge or other modalities, such as object region proposals or Contrastive Language-Image Pretraining (CLIP).

Speech Guided Masked Image Modeling for Visually Grounded Speech

Key Points

Abstract

Cite This Study