October 17, 2023Open Access

Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V

Key Points

Key points are not available for this paper at this time.

Abstract

We present Set-of-Mark (SoM), a new visual prompting method, to unleash the visual grounding abilities of large multimodal models (LMMs), such as GPT-4V. As illustrated in Fig. 1 (right), we employ off-the-shelf interactive segmentation models, such as SEEM/SAM, to partition an image into regions at different levels of granularity, and overlay these regions with a set of marks e.g., alphanumerics, masks, boxes. Using the marked image as input, GPT-4V can answer the questions that require visual grounding. We perform a comprehensive empirical study to validate the effectiveness of SoM on a wide range of fine-grained vision and multimodal tasks. For example, our experiments show that GPT-4V with SoM in zero-shot setting outperforms the state-of-the-art fully-finetuned referring expression comprehension and segmentation model on RefCOCOg. Code for SoM prompting is made public at: https://github.com/microsoft/SoM.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Jianwei Yang

Xiamen University of Technology

Hao Zhang

Jiangnan University

Feng Li

Second Affiliated Hospital of Inner Mongolia Medical University

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study