Referring Expression Comprehension (REC) aims to achieve fine-grained cross-modal content alignment. The traditional two-stage approaches, by decomposing REC into localization (region proposal) and comprehension (expression-based ranking), lead to the isolation of continuous image information and heavily rely on the quality of the proposals. In this paper, we propose a point-based two-stage framework for REC to quickly achieve localization by inserting a language-modulated auto-focus module into the locked vision model. Specifically, we redefine REC as two processes: point-based cross-modal comprehension and point-based instance localization. For the comprehension stage, we reconstruct the raw annotations into soft masks at the feature point level as a metric of cross-modal correlation. With this indirect metric, REC can be approximated as a binary classification problem, which fundamentally avoids the impact of isolated regions. Remarkably, soft masks are shape-independent, which means our method is extremely general. By switching different vision models, different types of predictions ( e.g. , localization and segmentation) can be obtained. Experiments on multiple benchmarks demonstrate the feasibility and potential of our point-based paradigm. Our code will be public at https://github.com/VILAN-Lab/PBREC-AF .
Building similarity graph...
Analyzing shared references across papers
Loading...
Shiyi Zheng
Peizhi Zhao
Qingbao Huang
ACM Transactions on Multimedia Computing Communications and Applications
The University of Adelaide
Guangxi University
Communication University of China
Building similarity graph...
Analyzing shared references across papers
Loading...
Zheng et al. (Thu,) studied this question.
www.synapsesocial.com/papers/692b9d831d383f2b2a3797ac — DOI: https://doi.org/10.1145/3777449
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: