Key points are not available for this paper at this time.
Referring Video Object Segmentation(RVOS) is a complex computer vision task that requires detecting, segmenting, and tracking a specific object in a video that is referred to by a given natural language expression. In this paper, we propose CDTD-RVOS(Cross-modal Decoding and Text Decoupling for RVOS), a novel Transformer-based deep neural network model for RVOS. The proposed model effectively extracts object-specific visual features at all levels of pixels, frames, and the entire video with Transformers. In order to capture correctly the meaning of the natural language referring expression, the model uses a text decoupling technique that divides the words of the referring expression into their functional components and encodes them into rich linguistic features. Moreover, the proposed model performs cross-modal fusion between the visual feature of video objects and the linguistic feature of the referring expression at all levels of pixels, frames, and the entire video to enhance alignment with two heterogeneous features. Extensive experiments conducted on three benchmark datasets, A2D-Sentences, Ref-Youtube-VOS, and Ref-DAVIS 17, show high performance of our proposed CDTD-RVOS model.
Chun et al. (Sun,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: