Key points are not available for this paper at this time.
We present Set-of-Mark (SoM), a new visual prompting method, to unleash the visual grounding abilities of large multimodal models (LMMs), such as GPT-4V. As illustrated in Fig. 1 (right), we employ off-the-shelf interactive segmentation models, such as SEEM/SAM, to partition an image into regions at different levels of granularity, and overlay these regions with a set of marks e.g., alphanumerics, masks, boxes. Using the marked image as input, GPT-4V can answer the questions that require visual grounding. We perform a comprehensive empirical study to validate the effectiveness of SoM on a wide range of fine-grained vision and multimodal tasks. For example, our experiments show that GPT-4V with SoM in zero-shot setting outperforms the state-of-the-art fully-finetuned referring expression comprehension and segmentation model on RefCOCOg. Code for SoM prompting is made public at: https://github.com/microsoft/SoM.
Building similarity graph...
Analyzing shared references across papers
Loading...
Jianwei Yang
Xiamen University of Technology
Hao Zhang
Jiangnan University
Feng Li
Second Affiliated Hospital of Inner Mongolia Medical University
Building similarity graph...
Analyzing shared references across papers
Loading...
Yang et al. (Tue,) studied this question.
synapsesocial.com/papers/6a16d36e0631ba25057b85f9 — DOI: https://doi.org/10.48550/arxiv.2310.11441