This paper concentrates on Multi-modal Referring Video Segmentation task, where a well optimized model is able to recognize and segment the target objects referred by the given guidance signals, e. g. , language description. Early approaches model this task as a sequence prediction problem. The lack of a global view of video content leads to difficulties in effectively utilizing inter-frame relationships. Some recent works propose to perform temporal modeling with vanilla attention mechanism. However, the condensed visual representation tends to be messy about target information due to occlusion or motion blur. Unlimited non-local operation would spread such noise to all the sequences and interfere with the extraction of global representations. To address the above issue, we present Semantic-assisted Object Cluster network (SOC) and the improved SOC++ in this paper. Our method unifies temporally selective interaction and cross-modal alignment to achieve video-level understanding. In SOC++, a proxy-assisted multi-modal fusion module is introduced to perform preliminary bidirectional activation. Then a semantic integration module with progressive frame-to-video structure facilitates joint space learning across modalities and time steps. Considering that potential noisy visual embeddings would impair the overall representation of target objects in unconstrained inter-frame interactions, we propose to perform tendentious video aggregation through emphasizing the indicative role of the informative frames with lower entropy in this part. A multi-modal query contrastive supervision is also utilized to help construct well-aligned joint space at the video level. Moreover, to integrate the advantage of high-level video information and the low-level details of each frame, we introduce a dynamic query fusion module that performs joint updating of these embeddings. We conduct extensive experiments on popular referring video segmentation benchmarks, and our method outperforms state-of-the-art competitors on all benchmarks by a remarkable margin. Besides, the emphasis on temporal coherence enhances the segmentation stability and adaptability of our method in processing text expressions with temporal variations. The code is available at https: //github. com/yongliu20/MRVSSOC.
Liu et al. (Wed,) studied this question.