May 24, 2024

Cross-modal Spectral Fusion Model for Referring Video Object Segmentation

KHKesi Huang TLTianxiao Li QXQiqiang Xia

Puntos clave

CSF achieves improved accuracy in referring video object segmentation tasks, outperforming traditional methods.
Key evidence shows results from evaluation across three datasets reveal the model's robustness.
Assessment using advanced multi-scale spectral fusion and consensus fusion modules enhances object segmentation performance significantly across various conditions and inputs. “,”Implications include potential applications for more sophisticated video understanding tasks, emphasizing the need for enhanced models beyond current methods.

Resumen

Referring Video Object Segmentation (R-VOS) demands precise visual comprehension and sophisticated cross-modal reasoning to segment objects in videos based on descriptions from natural language. Addressing this challenge, we introduce the Cross-modal Spectral Fusion Model (CSF). Our model incorporates a Multi-Scale Spectral Fusion Module (MSFM), which facilitates robust global interactions between the modalities, and a Consensus Fusion Module (CFM) that dynamically balances multiple prediction vectors based on text features and spectral cues for accurate mask generation. Additionally, the Dual-stream Mask Decoder (DMD) enhances the segmentation accuracy by capturing both local and global information through parallel processing. Tested on three datasets, CSF surpasses existing methods in R-VOS, proving its efficacy and potential for advanced video understanding tasks.

Preguntar a la IA

Me gusta

Guardar

Ver artículo completo