June 30, 2024Open Access

Cross-modal Object Decoding and Referring Expression Decoupling for Referring Video Object Segmentation

HCHyun-Jin Chun KPKyung‐Min ParkKonkuk University IKIncheol KimUniversity of Nebraska–Lincoln

Key Points

Key points are not available for this paper at this time.

Abstract

Referring Video Object Segmentation(RVOS) is a complex computer vision task that requires detecting, segmenting, and tracking a specific object in a video that is referred to by a given natural language expression. In this paper, we propose CDTD-RVOS(Cross-modal Decoding and Text Decoupling for RVOS), a novel Transformer-based deep neural network model for RVOS. The proposed model effectively extracts object-specific visual features at all levels of pixels, frames, and the entire video with Transformers. In order to capture correctly the meaning of the natural language referring expression, the model uses a text decoupling technique that divides the words of the referring expression into their functional components and encodes them into rich linguistic features. Moreover, the proposed model performs cross-modal fusion between the visual feature of video objects and the linguistic feature of the referring expression at all levels of pixels, frames, and the entire video to enhance alignment with two heterogeneous features. Extensive experiments conducted on three benchmark datasets, A2D-Sentences, Ref-Youtube-VOS, and Ref-DAVIS 17, show high performance of our proposed CDTD-RVOS model.

KI fragen

Bookmark

View Full Paper

Cite This Study

Chun et al. (Sun,) studied this question.

synapsesocial.com/papers/68e6278eb6db6435875b9ccc https://doi.org/https://doi.org/10.9717/kmms.2024.27.6.643

KI fragen

Bookmark

View Full Paper