November 30, 2025

Implement Referring Expression Comprehension by Extending Auto-focus Lens to Locked Vision Model

Key Points

Segmentation and localization are effectively enhanced using a point-based framework, improving REC outcomes.
Binary classification approach allows for flexible comprehension through soft masks, redefining traditional methods.
Framework employs language-modulated auto-focus for efficient localization, focusing on cross-modal alignment.
The study highlights the adaptability of this method across various vision models, supporting broad application in REC.

Abstract

Referring Expression Comprehension (REC) aims to achieve fine-grained cross-modal content alignment. The traditional two-stage approaches, by decomposing REC into localization (region proposal) and comprehension (expression-based ranking), lead to the isolation of continuous image information and heavily rely on the quality of the proposals. In this paper, we propose a point-based two-stage framework for REC to quickly achieve localization by inserting a language-modulated auto-focus module into the locked vision model. Specifically, we redefine REC as two processes: point-based cross-modal comprehension and point-based instance localization. For the comprehension stage, we reconstruct the raw annotations into soft masks at the feature point level as a metric of cross-modal correlation. With this indirect metric, REC can be approximated as a binary classification problem, which fundamentally avoids the impact of isolated regions. Remarkably, soft masks are shape-independent, which means our method is extremely general. By switching different vision models, different types of predictions ( e.g. , localization and segmentation) can be obtained. Experiments on multiple benchmarks demonstrate the feasibility and potential of our point-based paradigm. Our code will be public at https://github.com/VILAN-Lab/PBREC-AF .

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Shiyi Zheng

Peizhi Zhao

Qingbao Huang

Journals

ACM Transactions on Multimedia Computing Communications and Applications

Actions

Institutions

The University of Adelaide

Guangxi University

Communication University of China

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Implement Referring Expression Comprehension by Extending Auto-focus Lens to Locked Vision Model

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study

Also consider