Referring Expression Comprehension (REC) is usually addressed with task-trained grounding models. We show that a zero-shot workflow, without any REC-specific training, can achieve competitive or superior performance. Our approach reformulates REC as box-wise visual-language verification: given proposals from a COCO-clean generic detector (YOLO-World), a general-purpose VLM independently answers True/False queries for each region. This simple procedure reduces cross-box interference, supports abstention and multiple matches, and requires no fine-tuning. On RefCOCO, RefCOCO+, and RefCOCOg, our method not only surpasses a zero-shot GroundingDINO baseline but also exceeds reported results for GroundingDINO trained on REC and GroundingDINO+CRG. Controlled studies with identical proposals confirm that verification significantly outperforms selection-based prompting, and results hold with open VLMs. Overall, we show that workflow design, rather than task-specific pretraining, drives strong zero-shot REC performance.
Building similarity graph...
Analyzing shared references across papers
Loading...
Jeffrey Liu
Runjie Hu
Building similarity graph...
Analyzing shared references across papers
Loading...
Liu et al. (Fri,) studied this question.
www.synapsesocial.com/papers/68dc1e308a7d58c25ebb138d — DOI: https://doi.org/10.48550/arxiv.2509.09958
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: