What type of study is this?

This is a Quantitative Study study.

October 7, 2025Open Access

Audio Visual Segmentation Through Text Embeddings

Key Points

AV2T-SAM significantly enhances audio-visual segmentation by bridging audio features with text embeddings.
The method demonstrates improved audio-visual alignment, outperforming existing approaches on the AVSBench dataset.
Leveraging pre-trained models, the framework effectively addresses data scarcity through multimodal learning.
The introduction of shared semantics in audio and visual modalities allows for better filtering of irrelevant noise.

Abstract

The goal of Audio-Visual Segmentation (AVS) is to localize and segment the sounding source objects from video frames. Research on AVS suffers from data scarcity due to the high cost of fine-grained manual annotations. Recent works attempt to overcome the challenge of limited data by leveraging the vision foundation model, Segment Anything Model (SAM), prompting it with audio to enhance its ability to segment sounding source objects. While this approach alleviates the model's burden on understanding visual modality by utilizing knowledge of pre-trained SAM, it does not address the fundamental challenge of learning audio-visual correspondence with limited data. To address this limitation, we propose AV2T-SAM, a novel framework that bridges audio features with the text embedding space of pre-trained text-prompted SAM. Our method leverages multimodal correspondence learned from rich text-image paired datasets to enhance audio-visual alignment. Furthermore, we introduce a novel feature, {f₂₋₈₏ f₂₋₀₏}, which emphasizes shared semantics of audio and visual modalities while filtering irrelevant noise. Our approach outperforms existing methods on the AVSBench dataset by effectively utilizing pre-trained segmentation models and cross-modal semantic alignment. The source code is released at https: //github. com/bok-bok/AV2T-SAM.

Audio Visual Segmentation Through Text Embeddings

Key Points

Abstract

Cite This Study